The DisImpact
R package contains functions that help in determining disproportionate impact (DI) based on the following methodologies:
# From CRAN (Official) install.packages('DisImpact') # From github (Development) devtools::install_github('vinhdizzo/DisImpact')
library(DisImpact) library(dplyr) # Ease in manipulations with data frames
To illustrate the functionality of the package, let's load a toy data set:
# Load fake data set data(student_equity) # Print first few observations head(student_equity) # For description of data set ## ?student_equity
For a description of the student_equity
data set, type ?student_equity
in the R console.
The toy data set can be summarized as follows:
# Summarize toy data dim(student_equity) dSumm <- student_equity %>% group_by(Cohort, Ethnicity) %>% summarize(n=n(), Transfer_Rate=mean(Transfer)) dSumm ## This is a summarized version of the data set
di_ppg
is the main work function, and it can take on vectors or column names the tidy way:
# Vector di_ppg(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame # Tidy and column reference di_ppg(success=Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame
For a description of the di_ppg
function, including both function arguments and returned results, type ?di_ppg
in the R console.
Sometimes, one might want to break out the DI calculation by cohort:
# Cohort di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>% as.data.frame
di_ppg
is also applicable to summarized data; just pass the counts to success
and group size to weight
. For example, we use the summarized data set, dSumm
, and sample size n
, in the following:
di_ppg(success=Transfer_Rate*n, group=Ethnicity, cohort=Cohort, weight=n, data=dSumm) %>% as.data.frame
By default, di_ppg
uses the overall success rate as the reference rate for comparison (default: reference='overall'
). The reference
argument also accepts 'hpg'
(highest performing group success rate as the reference rate), 'all but current'
(success rate of all groups combined excluding the comparison group), or a group value from group
.
# Reference: Highest performing group di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='hpg', data=student_equity) %>% as.data.frame # Reference: All but current (PPG minus 1) di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='all but current', data=student_equity) %>% as.data.frame # Reference: custom group di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='White', data=student_equity) %>% as.data.frame di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='Asian', data=student_equity) %>% as.data.frame
The user could also pass in custom reference points for comparison (e.g., a state-wide rate). di_ppg
accepts either a single reference point to be used or a vector of reference points, one for each cohort. For the latter, the vector of reference points will be taken to correspond to the cohort
variable, alphabetically ordered.
# With custom reference (single) di_ppg(success=Transfer, group=Ethnicity, reference=0.54, data=student_equity) %>% as.data.frame # With custom reference (multiple) di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference=c(0.5, 0.55), data=student_equity) %>% as.data.frame
Disproportionate impact using the PPG relies on calculating the margine margin of error (MOE) pertaining around the success rate. The MOE calculated in di_ppg
has 2 underlying assumptions (defaults):
To override 1, the user could specify min_moe
in di_ppg
. To override 2, the user could specify use_prop_in_moe=TRUE
in di_ppg
.
# min_moe di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02) %>% as.data.frame # use_prop_in_moe di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02, use_prop_in_moe=TRUE) %>% as.data.frame
In cases where the proportion is used in calculating MOE, an observed proportion of 0 or 1 would lead to a zero MOE. To account for these scenarios, the user could leverage the prop_sub_0
and prop_sub_1
parameters in di_ppg
and ppg_moe
as substitutes. These parameters default to 0.5
, which maximizes the MOE (making it more difficult to declare disproportionate impact).
# Set Native American to have have zero transfers and see what the results di_ppg(success=Transfer, group=Ethnicity, data=student_equity %>% mutate(Transfer=ifelse(Ethnicity=='Native American', 0, Transfer)), use_prop_in_moe=TRUE, prop_sub_0=0.1, prop_sub_1=0.9) %>% as.data.frame
di_prop_index
is the main work function for this method, and it can take on vectors or column names the tidy way:
# Without cohort ## Vector di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame ## Tidy and column reference di_prop_index(success=Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame # With cohort ## Vector di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame ## Tidy and column reference di_prop_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>% as.data.frame
For a description of the di_prop_index
function, including both function arguments and returned results, type ?di_prop_index
in the R console.
Note that the referenced document describing this method does not recommend a threshold on the proportionality index for declaring disproportionate impact. The di_prop_index
function uses di_prop_index_cutoff=0.8
as the default threshold, which the user could change.
# Changing threshold for DI di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_prop_index_cutoff=0.5) %>% as.data.frame
di_80_index
is the main work function for this method, and it can take on vectors or column names the tidy way:
# Without cohort ## Vector di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame ## Tidy and column reference di_80_index(success=Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame # With cohort ## Vector di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame ## Tidy and column reference di_80_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>% as.data.frame
For a description of the di_80_index
function, including both function arguments and returned results, type ?di_80_index
in the R console.
By default, di_80_index
uses the group with the highest success rate as reference in calculating the index. One could specify the the comparison group using the reference_group
argument (a value from group
).
# Changing reference group di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, reference_group='White') %>% as.data.frame
By default, di_80_index
uses 80% (di_80_index_cutoff=0.80
) as the default threshold for declaring disproportionate impact. One could override this using another threshold via the di_80_index_cutoff
argument.
# Changing threshold for DI di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_80_index_cutoff=0.50) %>% as.data.frame
All methods and functions implemented in the DisImpact
package treat outcomes as positive: 1 is desired over 0 (higher rate is better, lower rate indicates disparity). The choice of the name success
in the functions' arguments is intentional to remind the user of this.
Suppose we have a variable that indicates something negative (e.g., a flag for students on academic probation). We could calculate DI on the converse of it by using the !
(logical negation) operator:
## di_ppg(success=!Probation, group=Ethnicity, data=student_equity) %>% ## as.data.frame ## If there were a Probation variable di_ppg(success=!Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame ## Illustrating the point with `!`
We can compute the success, group, and cohort variables on the fly:
# Transform success a <- sample(0:1, size=nrow(student_equity), replace=TRUE, prob=c(0.95, 0.05)) mean(a) di_ppg(success=pmax(Transfer, a), group=Ethnicity, data=student_equity) %>% as.data.frame # Collapse Black and Hispanic di_ppg(success=Transfer, group=ifelse(Ethnicity %in% c('Black', 'Hispanic'), 'Black/Hispanic', Ethnicity), data=student_equity) %>% as.data.frame
It is often the case that the user desires to calculate disproportionate impact across many outcome variables and many disaggregation/group variables. The function di_iterate
allows the user to specify a data set and the various variables to iterate across:
# Multiple group variables di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall') %>% as.data.frame # Multiple group variables and different reference groups bind_rows( di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall') , di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups=c('White', 'Male'), include_non_disagg_results=FALSE) # include_non_disagg_results = FALSE: Already have this scenario in Overall run )
There is a separate vignette that explains how one might leverage di_iterate
for rapid dashboard development and deployment with disaggregation and disproportionate impact features.
This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.