Introduction to aiDIF: Detecting Differential Item Functioning in AI-Scored Assessments
In aiDIF: Differential Item Functioning for AI-Scored Assessments

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(aiDIF)

Background

When AI systems score essays, short-answer responses, or structured tasks, a critical fairness question arises: does the AI scoring engine shift item difficulties differently for different demographic groups?

Classical DIF methods test whether an item performs differently across groups within a single scoring condition. aiDIF extends this to a paired design:

Human-scoring DIF — robust M-estimation of item-level bias
AI-scoring DIF — the same analysis applied to AI-scored data
Differential AI Scoring Bias (DASB) — a new test for group-dependent parameter shifts from human to AI scoring

The Example Dataset

make_aidif_eg() returns a built-in example with item parameter MLEs for 6 items in two groups under both scoring conditions. The planted structure is:

Item 1: DIF in human scoring (intercept +0.5 in focal group)
Item 3: DASB — AI scoring adds +0.4 to the focal group intercept only
Impact: 0.5 SD (focal group higher on the latent trait)
AI drift: +0.1 uniform calibration offset across all items

eg <- make_aidif_eg()
str(eg, max.level = 2)

Fitting the Model

fit_aidif() runs the robust IRLS engine under each scoring condition and performs the DASB test.

mod <- fit_aidif(
  human_mle = eg$human,
  ai_mle    = eg$ai,
  alpha     = 0.05
)
print(mod)

Full Report

summary(mod)

The DASB Test

scoring_bias_test() can also be called directly.

sb <- scoring_bias_test(eg$human, eg$ai)
print(sb)

Item 3 should be significant, reflecting the planted group-dependent AI scoring bias.

AI-Effect Classification

eff <- ai_effect_summary(mod$dif_human, mod$dif_ai)
print(eff)

| Status | Meaning | |---|---| | introduced | AI scoring creates DIF not present under human scoring | | masked | AI scoring hides DIF that existed under human scoring | | stable_dif | DIF detected in both conditions | | stable_clean | No DIF in either condition |

Visualisations

plot(mod, type = "dif_forest")   # human vs AI DIF side by side
plot(mod, type = "dasb")         # DASB bar chart with error bars
plot(mod, type = "weights")      # bi-square anchor weights

Simulation

dat <- simulate_aidif_data(
  n_items    = 8,
  n_obs      = 600,
  dif_items  = c(1, 2),
  dif_mag    = 0.5,
  dasb_items = 5,
  dasb_mag   = 0.4,
  seed       = 123
)
sim_mod <- fit_aidif(dat$human, dat$ai)
print(sim_mod)

References

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Test validity (pp. 129–145). Erlbaum.
Halpin, P., Nickodem, K., & Eagle, J. (2024). robustDIF: Differential Item Functioning Using Robust Scaling. R package version 0.2.0.