knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(aiDIF)
When AI systems score essays, short-answer responses, or structured tasks, a critical fairness question arises: does the AI scoring engine shift item difficulties differently for different demographic groups?
Classical DIF methods test whether an item performs differently across groups
within a single scoring condition. aiDIF extends this to a paired design:
make_aidif_eg() returns a built-in example with item parameter MLEs for
6 items in two groups under both scoring conditions. The planted structure is:
eg <- make_aidif_eg() str(eg, max.level = 2)
fit_aidif() runs the robust IRLS engine under each scoring condition and
performs the DASB test.
mod <- fit_aidif( human_mle = eg$human, ai_mle = eg$ai, alpha = 0.05 ) print(mod)
summary(mod)
scoring_bias_test() can also be called directly.
sb <- scoring_bias_test(eg$human, eg$ai) print(sb)
Item 3 should be significant, reflecting the planted group-dependent AI scoring bias.
eff <- ai_effect_summary(mod$dif_human, mod$dif_ai) print(eff)
| Status | Meaning |
|---|---|
| introduced | AI scoring creates DIF not present under human scoring |
| masked | AI scoring hides DIF that existed under human scoring |
| stable_dif | DIF detected in both conditions |
| stable_clean | No DIF in either condition |
plot(mod, type = "dif_forest") # human vs AI DIF side by side plot(mod, type = "dasb") # DASB bar chart with error bars plot(mod, type = "weights") # bi-square anchor weights
dat <- simulate_aidif_data( n_items = 8, n_obs = 600, dif_items = c(1, 2), dif_mag = 0.5, dasb_items = 5, dasb_mag = 0.4, seed = 123 ) sim_mod <- fit_aidif(dat$human, dat$ai) print(sim_mod)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.