DisImpact Tutorial"

Introduction

The DisImpact R package contains functions that help in determining disproportionate impact (DI) based on the following methodologies:

  1. percentage point gap (PPG) method,
  2. proportionality index method (method #1 in reference), and
  3. 80% index method (method #2 in reference).

Install Package

# From CRAN (Official)
install.packages('DisImpact')

# From github (Development)
devtools::install_github('vinhdizzo/DisImpact')

Load Packages

library(DisImpact)
library(dplyr) # Ease in manipulations with data frames

Load toy student equity data

To illustrate the functionality of the package, let's load a toy data set:

# Load fake data set
data(student_equity)

# Print first few observations
head(student_equity)

# For description of data set
## ?student_equity

For a description of the student_equity data set, type ?student_equity in the R console.

The toy data set can be summarized as follows:

# Summarize toy data
dim(student_equity)
dSumm <- student_equity %>%
  group_by(Cohort, Ethnicity) %>%
  summarize(n=n(), Transfer_Rate=mean(Transfer))
dSumm ## This is a summarized version of the data set

Percentage point gap (PPG) method

di_ppg is the main work function, and it can take on vectors or column names the tidy way:

# Vector
di_ppg(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
# Tidy and column reference
di_ppg(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

For a description of the di_ppg function, including both function arguments and returned results, type ?di_ppg in the R console.

Sometimes, one might want to break out the DI calculation by cohort:

# Cohort
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame

di_ppg is also applicable to summarized data; just pass the counts to success and group size to weight. For example, we use the summarized data set, dSumm, and sample size n, in the following:

di_ppg(success=Transfer_Rate*n, group=Ethnicity, cohort=Cohort, weight=n, data=dSumm) %>%
  as.data.frame

By default, di_ppg uses the overall success rate as the reference rate for comparison (default: reference='overall'). The reference argument also accepts 'hpg' (highest performing group success rate as the reference rate), 'all but current' (success rate of all groups combined excluding the comparison group), or a group value from group.

# Reference: Highest performing group
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='hpg', data=student_equity) %>%
  as.data.frame

# Reference: All but current (PPG minus 1)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='all but current', data=student_equity) %>%
  as.data.frame

# Reference: custom group
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='White', data=student_equity) %>%
  as.data.frame
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='Asian', data=student_equity) %>%
  as.data.frame

The user could also pass in custom reference points for comparison (e.g., a state-wide rate). di_ppg accepts either a single reference point to be used or a vector of reference points, one for each cohort. For the latter, the vector of reference points will be taken to correspond to the cohort variable, alphabetically ordered.

# With custom reference (single)
di_ppg(success=Transfer, group=Ethnicity, reference=0.54, data=student_equity) %>%
  as.data.frame

# With custom reference (multiple)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference=c(0.5, 0.55), data=student_equity) %>%
  as.data.frame

Disproportionate impact using the PPG relies on calculating the margine margin of error (MOE) pertaining around the success rate. The MOE calculated in di_ppg has 2 underlying assumptions (defaults):

  1. the minimum MOE returned is 0.03, and
  2. using 0.50 as the proportion in the margin of error formula, $1.96 \times \sqrt{\hat{p} (1-\hat{p}) / n}$.

To override 1, the user could specify min_moe in di_ppg. To override 2, the user could specify use_prop_in_moe=TRUE in di_ppg.

# min_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02) %>%
  as.data.frame
# use_prop_in_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02, use_prop_in_moe=TRUE) %>%
  as.data.frame

In cases where the proportion is used in calculating MOE, an observed proportion of 0 or 1 would lead to a zero MOE. To account for these scenarios, the user could leverage the prop_sub_0 and prop_sub_1 parameters in di_ppg and ppg_moe as substitutes. These parameters default to 0.5, which maximizes the MOE (making it more difficult to declare disproportionate impact).

# Set Native American to have have zero transfers and see what the results
di_ppg(success=Transfer, group=Ethnicity, data=student_equity %>% mutate(Transfer=ifelse(Ethnicity=='Native American', 0, Transfer)), use_prop_in_moe=TRUE, prop_sub_0=0.1, prop_sub_1=0.9) %>%
  as.data.frame

Proportionality index method

di_prop_index is the main work function for this method, and it can take on vectors or column names the tidy way:

# Without cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

# With cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame

For a description of the di_prop_index function, including both function arguments and returned results, type ?di_prop_index in the R console.

Note that the referenced document describing this method does not recommend a threshold on the proportionality index for declaring disproportionate impact. The di_prop_index function uses di_prop_index_cutoff=0.8 as the default threshold, which the user could change.

# Changing threshold for DI
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_prop_index_cutoff=0.5) %>% as.data.frame

80% index method

di_80_index is the main work function for this method, and it can take on vectors or column names the tidy way:

# Without cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

# With cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame

For a description of the di_80_index function, including both function arguments and returned results, type ?di_80_index in the R console.

By default, di_80_index uses the group with the highest success rate as reference in calculating the index. One could specify the the comparison group using the reference_group argument (a value from group).

# Changing reference group
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, reference_group='White') %>% as.data.frame

By default, di_80_index uses 80% (di_80_index_cutoff=0.80) as the default threshold for declaring disproportionate impact. One could override this using another threshold via the di_80_index_cutoff argument.

# Changing threshold for DI
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_80_index_cutoff=0.50) %>% as.data.frame

When dealing with a non-success variable like drop-out or probation

All methods and functions implemented in the DisImpact package treat outcomes as positive: 1 is desired over 0 (higher rate is better, lower rate indicates disparity). The choice of the name success in the functions' arguments is intentional to remind the user of this.

Suppose we have a variable that indicates something negative (e.g., a flag for students on academic probation). We could calculate DI on the converse of it by using the ! (logical negation) operator:

## di_ppg(success=!Probation, group=Ethnicity, data=student_equity) %>%
##   as.data.frame ## If there were a Probation variable
di_ppg(success=!Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame ## Illustrating the point with `!`

Transformations on the fly

We can compute the success, group, and cohort variables on the fly:

# Transform success
a <- sample(0:1, size=nrow(student_equity), replace=TRUE, prob=c(0.95, 0.05))
mean(a)
di_ppg(success=pmax(Transfer, a), group=Ethnicity, data=student_equity) %>%
  as.data.frame

# Collapse Black and Hispanic
di_ppg(success=Transfer, group=ifelse(Ethnicity %in% c('Black', 'Hispanic'), 'Black/Hispanic', Ethnicity), data=student_equity) %>% as.data.frame

Calculate DI for many variables and groups

It is often the case that the user desires to calculate disproportionate impact across many outcome variables and many disaggregation/group variables. The function di_iterate allows the user to specify a data set and the various variables to iterate across:

# Multiple group variables
di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall') %>% as.data.frame

# Multiple group variables and different reference groups

bind_rows(
  di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall')
  , di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups=c('White', 'Male'), include_non_disagg_results=FALSE) # include_non_disagg_results = FALSE: Already have this scenario in Overall run
)

There is a separate vignette that explains how one might leverage di_iterate for rapid dashboard development and deployment with disaggregation and disproportionate impact features.

Appendix: R and R Package Versions

This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.

sessionInfo()


Try the DisImpact package in your browser

Any scripts or data that you put into this service are public.

DisImpact documentation built on Oct. 11, 2022, 1:06 a.m.