#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-060-")
Before we can treat "starters", we have to introduce "FYE proxies"---estimates of the degree-granting engineering programs First-Year Engineering (FYE) students would have declared had they not been required to enroll in FYE.
Users of midfielddata practice data are not required to reproduce this vignette---the results are included with midfieldr in the fye_proxy
data set.
This vignette in the MIDFIELD workflow.
At some US institutions, engineering students are required to complete a First-Year Engineering (FYE) program as a prerequisite for declaring an engineering major. Administratively, degree-granting engineering programs such as Electrical Engineering or Mechanical Engineering treat their incoming post-FYE students as their "starting" cohorts. However, when computing a metric such as graduation rate that requires a count of starters, FYE records must be treated with special care to avoid a miscount.
To illustrate the potential for miscounting starters, suppose we wish to calculate a Mechanical Engineering (ME) graduation rate. Students starting in ME constitute the starting pool and the fraction of that pool graduating in ME is the graduation rate.
At FYE institutions, an ME program would typically define their starting pool as the post-FYE cohort entering their program. This may be the best information available, but it invariably undercounts starters by failing to account for FYE students who leave the institution or switch to non-engineering majors. In the absence of the FYE requirement some of these students would have been ME starters. By neglecting these students, the count of ME starters is artificially low resulting in an ME graduation rate that is artificially high. The same is true for every degree-granting engineering major in an FYE institution.
Because of the special nature of FYE programs, we cannot address starter miscounts by grouping FYE students with those admitted with "undecided" or "unknown" CIP codes---FYE students are neither. They were admitted as Engineering majors (2-digit CIP 14). We simply don't know to which degree-granting program (6-digit CIP) they intended to transition.
Therefore, to avoid miscounting starters at FYE institutions, we estimate the 6-digit CIP codes of the degree-granting engineering programs that FYE students would have declared had they not been required to enroll in FYE.
We apply prep_fye_mice()
to the student
and term
source files to construct a data frame suitable for imputation using the mice R package. The procedure has four steps:
Use prep_fye_mice()
from the midfieldr package to estimate some of the FYE proxy CIPs, treat the remainder as missing values, and structure the data frame for imputation.
Optional. If the default predictor variables (institution, race/ethnicity, and sex) do not meet the needs of your study, you can define your own.
Use mice()
from the mice package to impute the 6-digit CIP missing values.
Post-processing to convert the results to useful form and to remove migrators.
Three outcomes are possible, depending on your goals and available data:
Use midfielddata practice data to recreate the fye_proxy
data set included with midfieldr---as we do in this vignette.
Use midfielddata practice data to create an alternate set of FYE proxies based on a different random number seed or different predictor variables. The result would have the same IDs as fye_proxy
but different ID-proxy pairings.
Use MIDFIELD research data and construct your own FYE proxies.
For a given set of source files, FYE proxies need be created only once and written to file. The result can be used as needed unless the source files change.
Start. If you are writing your own script to follow along, we use these packages in this article:
library("midfieldr") library("midfielddata") library("data.table") library("ggplot2") library("mice")
Load. Practice datasets. View data dictionaries via ?student
, ?term
.
# Load practice data data(student, term)
Loads with midfieldr. Prepared data, derived in Programs. View data dictionary via ?study_programs
.
study_programs
Unlike the initial processing in previous articles, we do not filter for data sufficiency and degree seeking.
Select (optional). Reduce the number of columns. Code reproduced from Getting started.
# Optional. Copy of source files with all variables source_student <- copy(student) source_term <- copy(term) # Optional. Select variables required by midfieldr functions student <- select_required(source_student) term <- select_required(source_term)
prep_fye_mice()
The purpose of prep_fye_mice()
is preparing a data frame for the mice R package. Operates on the complete, unfiltered student
and term
source data to create a data frame with three predictor variables and an FYE proxy variable. The values in proxy
are determined by a student's first post-FYE program code, as follows:
Post-FYE in Engineering. The student completes FYE and enrolls in an engineering major. For this outcome, we know that at the student's first opportunity, they enrolled in an engineering major of their choosing. The CIP code of that program is returned as the student's FYE proxy.
Post-FYE not in Engineering. The student migrates to a non-engineering major or has no post-FYE records in the database. The data provide no indication of the student's preferred degree-granting engineering major. Thus their FYE proxy value is returned as NA, to be treated as missing data to be imputed.
Arguments.
midfield_student
Data frame of student observations, keyed by student ID. Default is student
. Required variables are mcid
, race
, and sex
. Use all rows of your source student
data table.
midfield_term
Data frame of term observations keyed by student ID. Default is term
. Required variables are mcid
, institution
, term
, and cip6
. Use all rows of your source term
data table.
fye_codes
Optional character vector of 6-digit CIP codes assigned to FYE programs. Default is "140102". Argument to be used by name.
Implicit arguments. The following implementations yield identical results.
# Required arguments in order and explicitly named x <- prep_fye_mice(midfield_student = student, midfield_term = term) # Required arguments in order, but not named y <- prep_fye_mice(student, term) # Using the implicit defaults z <- prep_fye_mice() # Demonstrate equivalence check_equiv_frames(x, y) check_equiv_frames(x, z)
Output. The function returns one row per FYE student keyed by student ID. All variables except ID are returned as factors to meet the requirements of mice()
.
op <- options() options(datatable.print.class = TRUE) # Working data frame DT <- prep_fye_mice(student, term) DT options(op)
#| echo: false N_ever_fye <- length(unique(DT$mcid)) N_to_impute <- sum(is.na(DT$proxy))
The output of prep_fye_mive()
should contain missing values in the proxy column only. Other variables are complete. A race/ethnicity or sex value of "unknown" is treated as an observed value, not missing data. And while no values of ID or institution are unknown or missing in this example, such observations (if they existed) would have to be removed.
Checking that all variables except proxy
are complete.
# Number of unique IDs x <- length(unique(DT$mcid)) # Number of complete cases on four variables y <- sum(complete.cases(DT[, .(mcid, race, sex, institution)])) # Demonstrate equivalence all.equal(x, y)
Number of missing observations in proxy
.
# Number NAs in proxy sum(is.na(DT$proxy)) # Percentage NAs in proxy 100 * round(sum(is.na(DT$proxy)) / nrow(DT), 3)
Missing at random (MAR). These missing proxy
data are caused by a student's decision to migrate to a non-engineering major or to leave the database. At the time of making that decision, the FYE student would not yet have enrolled in a degree-granting engineering major, thus their decision is unlikely to be related to any specific engineering major.
That a CIP is missing, therefore, is unlikely to be related to a specific CIP value---but may be related to other observations such as institution, race/ethnicity, or sex. Missing data of this type are classified as "missing at random" (MAR) which are suitable for multiple imputation and yield unbiased results [@GraceMartin:2012].
Multiple imputation. Lastly, while 5--10 imputations are generally considered adequate for unbiasedness, Bodner [-@Bodner:2008] recommends having as many imputations as the percentage of missing data.
# Number of proxies to be imputed (N_impute <- sum(is.na(DT$proxy))) # Number of observations with complete predictor information (N_complete <- sum(complete.cases(DT[, .(mcid, race, sex, institution)]))) # Percent missing proxies (percent_missing <- round(100 * N_impute / N_complete, 3))
As shown here, the overall percentage of missing data is r percent_missing
%, suggesting we set the number of imputations to r round(percent_missing, 0)
.
# For the "m" argument in mice() (m_imputations <- round(percent_missing, 0))
Chart. The chart displays the percent missing data by category. The institution
category isn't used because the practice data contain FYE students in one institution only. The vertical dashed line indicates the r m_imputations
% percent missing data overall.
#| echo: false #| label: fig01 #| fig.asp: 0.35 #| fig-cap: "Figure 1: Percent missing data by category." # New memory location x <- copy(DT) # Convert factors to characters x <- x[, lapply(.SD, as.character)] # Overall percentage missing proxies overall_miss <- 100 * round(nrow(x[is.na(proxy)]) / nrow(x), 3) # Function for determining percent missing proxies by category missing_fraction <- function(DT, cat_level) { DT[, group := fcase( is.na(proxy), "missing", default = "not_missing" )] DT <- DT[, .N, by = c("group", "cat_level")] DT <- dcast(DT, cat_level ~ group, value.var = "N") DT[, pct_miss := 100 * round(missing / (missing + not_missing), 3)] DT <- DT[, .(cat_level, pct_miss)] } # Apply to institution category y <- x[, .(cat_level = institution, proxy)] inst <- missing_fraction(y, cat_level = "institution") inst[, category := "institution"] # Apply to race/ethnicity category y <- x[, .(cat_level = race, proxy)] race <- missing_fraction(y, cat_level = "race") race[, category := "race"] # Apply to sex category y <- x[, .(cat_level = sex, proxy)] sex <- missing_fraction(y, cat_level = "sex") sex[, category := "sex"] # Gather results and plot # Use race and sex only, only one institution df <- rbindlist(list(race, sex)) ggplot(df, aes(x = pct_miss, y = reorder(cat_level, pct_miss))) + geom_vline(xintercept = overall_miss, linetype = 2, color = "gray35") + geom_point(size = 1.5) + facet_grid( rows = vars(reorder(category, pct_miss)), as.table = TRUE, scales = "free_y", space = "free_y" ) + labs(x = "FYE proxy missing values (%)", y = "") + scale_x_continuous(breaks = seq(0, 100, 5), limits = c(30, 55)) + theme(panel.grid.minor.x = element_blank())
mice()
The mice package [@vanBuuren+Oudshoorn:2011] implements multiple imputation by chained equations (MICE). MICE is also known as "fully conditional specification" or "sequential regression multiple imputation" and is suitable for categorical variables such as ours [@azur2011]. Our computational procedure follows the approach suggested by Dhana [-@dhana2017].
Framework. Our first use of mice()
is to examine the imputation framework by calling the function with zero iterations on the DT
data frame. mice()
produces a "multiply-imputed data set", an R object of class "mids".
# Imputation framework framework <- mice(DT, maxit = 0) framework
Logged events warning. The printout above includes a warning about two "logged events"---an indication that two variables will not be used as predictors. We can isolate the warning for a closer look,
# Examine the warning framework$loggedEvents
The two variables are mcid
and institution
.
mcid
was never intended to be a predictor variable. We retain the ID column so that imputed CIP values are assigned to specific IDs.
institution
usually is a predictor. In this case, however, the FYE students are all at the same institution---a characteristic of the midfielddata practice data only.
Imputation methods. We look more closely at two elements of this framework. The first is the imputation method vector.
# Imputation method method_vector <- framework[["method"]] method_vector
The "polyreg" imputation method (polytomous logistic regression) is appropriate for data, like ours, comprising unordered categorical variables. Variable proxy
is imputed using the polyreg method; the other variables, being predictors, are not imputed, thus their methods are empty.
Had the method not been correctly assigned, we would assign it as follows,
# Manually assign the variable(s) being imputed method_vector[c("proxy")] <- "polyreg" # Manually assign the variable(s) not being imputed method_vector[c("mcid", "institution", "race", "sex")] <- "" method_vector
Predictor matrix. The second element to review is the predictor matrix. A row label identifies the variable being predicted; the columns indicate the predictor variables.
# Imputation predictor matrix predictor_matrix <- framework[["predictorMatrix"]] predictor_matrix
However, only those variables assigned a method are imputed. In our case, the only variable to be imputed is proxy
, so the only row of this matrix that gets used is the last row.
# Predictor row for this example predictor_matrix["proxy", ]
The zeros and ones tell us that proxy
is going to be predicted by race and sex. Again, the institution variable is not a predictor because the practice data contain one FYE institution only. (This would not be the case if one were using the MIDFIELD research database.)
Had the default setting been incorrect, we can set them manually. Again, note that the bottom row is the only row we need because only the proxy
variable is being imputed.
# Manually assign zero columns predictor_matrix[, c("mcid", "proxy", "institution")] <- 0 # Manually assign predictor columns predictor_matrix[, c("race", "sex")] <- c(0, 0, 0, 0, 1) predictor_matrix
If the data included more than one FYE institution, the manual assignment would be,
#| eval: false # Not run predictor_matrix[, c("mcid", "proxy")] <- 0 predictor_matrix[, c("race", "sex", "institution")] <- c(0, 0, 0, 0, 1)
The default predictors set up by prep_fye_mice()
are institution (required), race/ethnicity, and sex. If these are acceptable, you can skip to the next section, [Imputing missing values].
Predictors can be edited or added before invoking mice()
. As before, ensure that the only missing values are in the proxy column. Other variables are expected to be complete (no NA values). A value of "unknown" in a predictor column, e.g., race/ethnicity or sex, is an acceptable value, not missing data. Observations with missing or unknown values in the ID or institution columns should be removed.
For example, suppose we wish to replace race/ethnicity and sex with a people
variable that has four possible values (Domestic Female
, Domestic Male
, International Female
, and International Male
) where "domestic" means a US citizen; and we want to add a variable that encodes the year
of a student's first term in FYE.
Creating variables. Remove any unknown observations of race/ethnicity and sex to create the desired people
variable.
# Data frame to illustrate optional predictors opt_DT <- copy(DT) # Factor to character cols_to_edit <- c("race", "sex") opt_DT[, (cols_to_edit) := lapply(.SD, as.character), .SDcols = cols_to_edit] # Filter unknown race and sex opt_DT <- opt_DT[sex != "Unknown"] opt_DT <- opt_DT[race != "Other/Unknown"] # Create origin variable opt_DT[, origin := fcase( race != "International", "Domestic", race == "International", "International", default = NA_character_ )] opt_DT <- opt_DT[!is.na(origin)] # Create people variable opt_DT[, people := paste(origin, sex)] opt_DT[, people := as.factor(people)] opt_DT[, c("race", "sex", "origin") := NULL] # Display result setcolorder(opt_DT, c("mcid", "people", "institution", "proxy")) opt_DT
Check the unique values.
# Display unique people sort(unique(opt_DT$people))
Adding a variable. Obtain the student's first term in the data set from the term
data table using a left-outer join.
# Add all term variables by ID cols_to_join <- term[, .(mcid, term)] opt_DT <- cols_to_join[opt_DT, on = c("mcid")] # Filter for first term setkeyv(opt_DT, c("mcid", "term")) opt_DT <- opt_DT[, .SD[1], by = c("mcid")] # Create year variable opt_DT[, year := substr(term, 1, 4)] opt_DT[, year := as.factor(year)] opt_DT[, term := NULL] # Display result setcolorder(opt_DT, c("mcid", "people", "institution", "year", "proxy")) opt_DT
Filtering. Ensure complete cases except in proxy
.
# Identify complete cases in predictor variables rows_we_want <- complete.cases(opt_DT[, .(mcid, people, institution, year)]) # Filter for complete predictors opt_DT <- opt_DT[rows_we_want] opt_DT
Framework for optional predictors.
# Imputation framework opt_framework <- mice(opt_DT, maxit = 0) opt_framework
Imputation method for optional predictors.
# Imputation framework opt_method_vector <- opt_framework[["method"]] opt_method_vector
Predictor matrix for optional predictors.
# Imputation predictor matrix opt_predictor_matrix <- opt_framework[["predictorMatrix"]] opt_predictor_matrix
Percent missing data for setting the number of multiple imputations.
N_impute <- sum(is.na(opt_DT$proxy)) N_fye <- nrow(opt_DT) # Percent missing data round(100 * N_impute / N_fye, 0)
The three essential arguments for mice()
are the DT
data frame, the method_vector
, and the predictor_matrix
. The number of multiple imputations m
is set to r m_imputations
as discussed in [Missing data]. The default seed
argument is NULL, but by setting the seed as shown the vignette results are reproducible. Setting printFlag = TRUE
displays progress in the console.
For the practice data, 5 iterations of r m_imputations
imputations takes about 3 minutes (depending on your machine). For MIDFIELD research data, however, imputation runs significantly longer.
#| echo: false # load DT_mids, don't have to repeatedly run mice() load(here::here("R", "sysdata.rda"))
#| eval: false # Impute missing proxy data DT_mids <- mice( data = DT, m = m_imputations, maxit = 5, # default method = method_vector, predictorMatrix = predictor_matrix, seed = 20180624, printFlag = TRUE )
# output in console with printFlag = TRUE # > iter imp variable # > 1 1 proxy # > 1 2 proxy # > 1 3 proxy # > 1 4 proxy # > 1 5 proxy # > --- # > 5 33 proxy # > 5 34 proxy # > 5 35 proxy # > 5 36 proxy # > 5 37 proxy
Extracting the result. We apply mice::complete()
to extract the data from the mids
object. The missing data have been replaced by imputed values.
# Revert to default random number generation set.seed(NULL) # Extract data from the mids object DT <- mice::complete(DT_mids) # Convert to data.table structure setDT(DT) DT <- DT[order(mcid)] DT
Selecting columns. To use the result, we need only two columns: IDs and the the predicted starting programs.
# Subset the data DT <- DT[, .(mcid, proxy)] DT
Recoding. We convert the CIP codes from factor to character.
# Convert factors DT[, proxy := as.character(proxy)] DT
Filtering. Proxies are substitutes for students starting in FYE. Thus we filter to remove migrators, retaining the proxies of first-term FYE students only.
# Order term data by ID and term ordered_term <- term[, .(mcid, term, cip6)] setorderv(ordered_term, cols = c("mcid", "term")) # Obtain first term of all students first_term <- ordered_term[, .SD[1], by = c("mcid")] # Reduce to first term in FYE first_term_fye_mcid <- first_term[cip6 == "140102", .(mcid)] # Inner join to remove migrators from working data frame DT <- first_term_fye_mcid[DT, on = c("mcid"), nomatch = NULL] setkey(DT, NULL) DT
Verify prepared data. To avoid deriving this data frame each time it is needed in other vignettes, the same information is provided in the fye_proxy
data frame included with midfieldr. Here we verify that the two data frames have the same content.
# Demonstrate equivalence check_equiv_frames(DT, fye_proxy)
#| eval: false #| echo: false # Run manually if necessary for reproducibility # Writing internal file sysdata.rda to save DT_mids # Writing external file fye_proxy # Internal files usethis::use_data( DT_mids, subset_student, subset_course, subset_term, subset_degree, internal = TRUE, overwrite = TRUE ) # External file fye_proxy <- copy(DT) usethis::use_data(fye_proxy, overwrite = TRUE)
Here we summarize the FYE proxy data set to see how many students our algorithm assigned to which engineering majors. Start by extracting the unique set of CIP codes from the proxy data set.
# Identify unique CIP codes in the proxy data proxy_cips <- sort(unique(fye_proxy$proxy)) proxy_cips
Obtain the program names from the cip
data set (provided with midfieldr). We use the 4-digit names that in engineering generally represent department-level programs.
op <- options() options(datatable.print.topn = 13) # Obtain the 4-digit program names corresponding to these codes proxy_program_names <- filter_cip(keep_text = proxy_cips) proxy_program_names <- proxy_program_names[, .(cip6, program = cip4name)] proxy_program_names
Join these names to the proxy data set, summarize by program, and order the rows by descending N.
# Join these program names to the proxy data proxy_programs <- proxy_program_names[fye_proxy[, .(cip6 = proxy)], .(program), on = c("cip6")] # Count by program and order rows in descending magnitude proxy_programs <- proxy_programs[, .N, by = c("program")] setorderv(proxy_programs, order = -1, cols = c("N")) proxy_programs options(op)
For comparison, the National Science Foundation (NSF) reports that in 2012, the top seven US engineering majors ranked by enrollment were [@NSF:2014:online]:
In Table 1, we show the FYE proxy programs and indicate the equivalent NSF ranking cited above. The assignment of proxies is fairly consistent with the NSF results, though the practice data have a higher frequency of aerospace proxies than expected. Recall that the practice data contain only three institutions while the NSF information is based on nearly 3000 US undergraduate institutions [@NSF:2014:ch2].
#| echo: false library(gt) x <- copy(proxy_programs) x$nsf <- c("1", "2", "3", "", "6", "5", "4", "", "", "7", "", "", "") setcolorder(x, c("program", "nsf", "N")) setnames(x, old = c("program", "nsf"), new = c("Program", "NSF ranking") ) x |> gt() |> tab_caption("Table 1: Frequency of FYE proxies using the practice data") |> tab_options(table.font.size = "small") |> opt_stylize(style = 1, color = "gray") |> tab_style( style = list(cell_fill(color = "#c7eae5")), locations = cells_column_labels(columns = everything()) )
We conclude that the imputation is credible at least to the extent that the ranking of the majors is generally consistent with expectations.
The main goal of estimating FYE proxies is to prevent starter miscounts. Here, we assess the potential for miscounts if FYE records are not treated as recommended.
We start with the first_term
data frame created earlier (in [Post-processing]) containing the initial term information of all students in the practice data.
# First term of all students
first_term
Identify starters, including FYE proxies, in the four case study programs. (This procedure is more fully developed in the Starters vignette.)
# Join proxies by ID (left join) to first-term data start <- fye_proxy[first_term, .(mcid, cip6, proxy), on = c("mcid")] # Distinguish FYE from direct matriculants start[, matric := fcase( is.na(proxy), "direct", !is.na(proxy), "fye" )] # Create start variable start[, start := fcase( matric == "fye", proxy, matric == "direct", cip6 )] # Filter to retain case study program starters join_labels <- copy(study_programs) setnames(join_labels, old = "cip6", new = "start") start <- join_labels[start, on = c("start")] start <- start[!is.na(program)] # Display result start[order(matric, start)]
This data frame contains all direct-matriculation starters in the case study programs plus the FYE students with one of these programs as their estimated proxy.
Grouping by program and type of matriculation, we can determine the FYE percentage of all starters.
# Summarize start <- start[, .N, by = c("matric", "program")] # Transform to row-record form start <- dcast(start, program ~ matric, value.var = "N") # Compute FYE as fraction of total start[, N_starters := direct + fye] start[, fye_pct := round(100 * fye / N_starters, 1)] start
The results indicate (for the case study data) a potential under-count of 45% to 97% if FYE proxies are excluded when counting starters.
Given the number of lines of code and the number of case-specific parameters involved, a reusable code section is not provided.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.