Introduction

The DisImpact R package contains functions that help in determining disproportionate impact (DI) based on the following methodologies:

  1. percentage point gap (PPG) method,
  2. proportionality index method (method #1 in reference), and
  3. 80% index method (method #2 in reference).

Install Package

# From CRAN (Official)
install.packages('DisImpact')

# From github (Development)
devtools::install_github('vinhdizzo/DisImpact')

Load Packages

library(DisImpact)
library(dplyr) # Ease in manipulations with data frames

Load toy student equity data

To illustrate the functionality of the package, let's load a toy data set:

# Load fake data set
data(student_equity)

# Print first few observations
head(student_equity)

# For description of data set
## ?student_equity

For a description of the student_equity data set, type ?student_equity in the R console.

The toy data set can be summarized as follows:

# Summarize toy data
dim(student_equity)
dSumm <- student_equity %>%
  group_by(Cohort, Ethnicity) %>%
  summarize(n=n(), Transfer_Rate=mean(Transfer))
dSumm ## This is a summarized version of the data set

Percentage point gap (PPG) method

di_ppg is the main work function, and it can take on vectors or column names the tidy way:

# Vector
di_ppg(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
# Tidy and column reference
di_ppg(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

For a description of the di_ppg function, including both function arguments and returned results, type ?di_ppg in the R console.

Sometimes, one might want to break out the DI calculation by cohort:

# Cohort
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame

di_ppg is also applicable to summarized data; just pass the counts to success and group size to weight. For example, we use the summarized data set, dSumm, and sample size n, in the following:

di_ppg(success=Transfer_Rate*n, group=Ethnicity, cohort=Cohort, weight=n, data=dSumm) %>%
  as.data.frame

By default, di_ppg uses the overall success rate as the reference rate for comparison (default: reference='overall'). The reference argument also accepts 'hpg' (highest performing group success rate as the reference rate), 'all but current' (success rate of all groups combined excluding the comparison group), or a group value from group.

# Reference: Highest performing group
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='hpg', data=student_equity) %>%
  as.data.frame

# Reference: All but current (PPG minus 1)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='all but current', data=student_equity) %>%
  as.data.frame

# Reference: custom group
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='White', data=student_equity) %>%
  as.data.frame
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='Asian', data=student_equity) %>%
  as.data.frame

The user could also pass in custom reference points for comparison (e.g., a state-wide rate). di_ppg accepts either a single reference point to be used or a vector of reference points, one for each cohort. For the latter, the vector of reference points will be taken to correspond to the cohort variable, alphabetically ordered.

# With custom reference (single)
di_ppg(success=Transfer, group=Ethnicity, reference=0.54, data=student_equity) %>%
  as.data.frame

# With custom reference (multiple)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference=c(0.5, 0.55), data=student_equity) %>%
  as.data.frame

Disproportionate impact using the PPG relies on calculating the margine margin of error (MOE) pertaining around the success rate. The MOE calculated in di_ppg has 2 underlying assumptions (defaults):

  1. the minimum MOE returned is 0.03, and
  2. using 0.50 as the proportion in the margin of error formula, $1.96 \times \sqrt{\hat{p} (1-\hat{p}) / n}$.

To override 1, the user could specify min_moe in di_ppg. To override 2, the user could specify use_prop_in_moe=TRUE in di_ppg.

# min_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02) %>%
  as.data.frame
# use_prop_in_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02, use_prop_in_moe=TRUE) %>%
  as.data.frame

In cases where the proportion is used in calculating MOE, an observed proportion of 0 or 1 would lead to a zero MOE. To account for these scenarios, the user could leverage the prop_sub_0 and prop_sub_1 parameters in di_ppg and ppg_moe as substitutes. These parameters default to 0.5, which maximizes the MOE (making it more difficult to declare disproportionate impact).

# Set Native American to have have zero transfers and see what the results
di_ppg(success=Transfer, group=Ethnicity, data=student_equity %>% mutate(Transfer=ifelse(Ethnicity=='Native American', 0, Transfer)), use_prop_in_moe=TRUE, prop_sub_0=0.1, prop_sub_1=0.9) %>%
  as.data.frame

Proportionality index method

di_prop_index is the main work function for this method, and it can take on vectors or column names the tidy way:

# Without cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

# With cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame

For a description of the di_prop_index function, including both function arguments and returned results, type ?di_prop_index in the R console.

Note that the referenced document describing this method does not recommend a threshold on the proportionality index for declaring disproportionate impact. The di_prop_index function uses di_prop_index_cutoff=0.8 as the default threshold, which the user could change.

# Changing threshold for DI
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_prop_index_cutoff=0.5) %>% as.data.frame

80% index method

di_80_index is the main work function for this method, and it can take on vectors or column names the tidy way:

# Without cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

# With cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame

For a description of the di_80_index function, including both function arguments and returned results, type ?di_80_index in the R console.

By default, di_80_index uses the group with the highest success rate as reference in calculating the index. One could specify the the comparison group using the reference_group argument (a value from group).

# Changing reference group
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, reference_group='White') %>% as.data.frame

By default, di_80_index uses 80% (di_80_index_cutoff=0.80) as the default threshold for declaring disproportionate impact. One could override this using another threshold via the di_80_index_cutoff argument.

# Changing threshold for DI
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_80_index_cutoff=0.50) %>% as.data.frame

When dealing with a non-success variable like drop-out or probation

All methods and functions implemented in the DisImpact package treat outcomes as positive: 1 is desired over 0 (higher rate is better, lower rate indicates disparity). The choice of the name success in the functions' arguments is intentional to remind the user of this.

Suppose we have a variable that indicates something negative (e.g., a flag for students on academic probation). We could calculate DI on the converse of it by using the ! (logical negation) operator:

## di_ppg(success=!Probation, group=Ethnicity, data=student_equity) %>%
##   as.data.frame ## If there were a Probation variable
di_ppg(success=!Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame ## Illustrating the point with `!`

Transformations on the fly

We can compute the success, group, and cohort variables on the fly:

# Transform success
a <- sample(0:1, size=nrow(student_equity), replace=TRUE, prob=c(0.95, 0.05))
mean(a)
di_ppg(success=pmax(Transfer, a), group=Ethnicity, data=student_equity) %>%
  as.data.frame

# Collapse Black and Hispanic
di_ppg(success=Transfer, group=ifelse(Ethnicity %in% c('Black', 'Hispanic'), 'Black/Hispanic', Ethnicity), data=student_equity) %>% as.data.frame

Calculate DI for many variables and groups

It is often the case that the user desires to calculate disproportionate impact across many outcome variables and many disaggregation/group variables. The function di_iterate allows the user to specify a data set and the various variables to iterate across:

# Multiple group variables
di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall') %>% as.data.frame

# Multiple group variables and different reference groups

bind_rows(
  di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall')
  , di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups=c('White', 'Male'), include_non_disagg_results=FALSE) # include_non_disagg_results = FALSE: Already have this scenario in Overall run
)

There is a separate vignette that explains how one might leverage di_iterate for rapid dashboard development and deployment with disaggregation and disproportionate impact features.

Appendix: R and R Package Versions

This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.

sessionInfo()


vinhdizzo/DisImpact documentation built on Oct. 15, 2022, 4:16 p.m.