In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-060-")

Before we can treat "starters", we have to introduce "FYE proxies"---estimates of the degree-granting engineering programs First-Year Engineering (FYE) students would have declared had they not been required to enroll in FYE.

Users of midfielddata practice data are not required to reproduce this vignette---the results are included with midfieldr in the fye_proxy data set.

This vignette in the MIDFIELD workflow.

Planning
Initial processing
[Blocs]{.accent}
- Ever-enrolled
- [FYE proxies]{.accent}
- Starters
- Graduates
Groupings
Metrics
Displays

Potential for starter miscounts

At some US institutions, engineering students are required to complete a First-Year Engineering (FYE) program as a prerequisite for declaring an engineering major. Administratively, degree-granting engineering programs such as Electrical Engineering or Mechanical Engineering treat their incoming post-FYE students as their "starting" cohorts. However, when computing a metric such as graduation rate that requires a count of starters, FYE records must be treated with special care to avoid a miscount.

To illustrate the potential for miscounting starters, suppose we wish to calculate a Mechanical Engineering (ME) graduation rate. Students starting in ME constitute the starting pool and the fraction of that pool graduating in ME is the graduation rate.

At FYE institutions, an ME program would typically define their starting pool as the post-FYE cohort entering their program. This may be the best information available, but it invariably undercounts starters by failing to account for FYE students who leave the institution or switch to non-engineering majors. In the absence of the FYE requirement some of these students would have been ME starters. By neglecting these students, the count of ME starters is artificially low resulting in an ME graduation rate that is artificially high. The same is true for every degree-granting engineering major in an FYE institution.

Because of the special nature of FYE programs, we cannot address starter miscounts by grouping FYE students with those admitted with "undecided" or "unknown" CIP codes---FYE students are neither. They were admitted as Engineering majors (2-digit CIP 14). We simply don't know to which degree-granting program (6-digit CIP) they intended to transition.

Therefore, to avoid miscounting starters at FYE institutions, we estimate the 6-digit CIP codes of the degree-granting engineering programs that FYE students would have declared had they not been required to enroll in FYE.

Definitions

Method

We apply prep_fye_mice() to the student and term source files to construct a data frame suitable for imputation using the mice R package. The procedure has four steps:

Use prep_fye_mice() from the midfieldr package to estimate some of the FYE proxy CIPs, treat the remainder as missing values, and structure the data frame for imputation.
Optional. If the default predictor variables (institution, race/ethnicity, and sex) do not meet the needs of your study, you can define your own.
Use mice() from the mice package to impute the 6-digit CIP missing values.
Post-processing to convert the results to useful form and to remove migrators.

Three outcomes are possible, depending on your goals and available data:

Use midfielddata practice data to recreate the fye_proxy data set included with midfieldr---as we do in this vignette.
Use midfielddata practice data to create an alternate set of FYE proxies based on a different random number seed or different predictor variables. The result would have the same IDs as fye_proxy but different ID-proxy pairings.
Use MIDFIELD research data and construct your own FYE proxies.

For a given set of source files, FYE proxies need be created only once and written to file. The result can be used as needed unless the source files change.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library("midfieldr")
library("midfielddata")
library("data.table")
library("ggplot2")
library("mice")

Load. Practice datasets. View data dictionaries via ?student, ?term.

# Load practice data
data(student, term)

Loads with midfieldr. Prepared data, derived in Programs. View data dictionary via ?study_programs.

study_programs

Initial processing

Unlike the initial processing in previous articles, we do not filter for data sufficiency and degree seeking.

Select (optional). Reduce the number of columns. Code reproduced from Getting started.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)

`prep_fye_mice()`

The purpose of prep_fye_mice() is preparing a data frame for the mice R package. Operates on the complete, unfiltered student and term source data to create a data frame with three predictor variables and an FYE proxy variable. The values in proxy are determined by a student's first post-FYE program code, as follows:

Post-FYE in Engineering. The student completes FYE and enrolls in an engineering major. For this outcome, we know that at the student's first opportunity, they enrolled in an engineering major of their choosing. The CIP code of that program is returned as the student's FYE proxy.
Post-FYE not in Engineering. The student migrates to a non-engineering major or has no post-FYE records in the database. The data provide no indication of the student's preferred degree-granting engineering major. Thus their FYE proxy value is returned as NA, to be treated as missing data to be imputed.

Arguments.

midfield_student Data frame of student observations, keyed by student ID. Default is student. Required variables are mcid, race, and sex. Use all rows of your source student data table.
midfield_term Data frame of term observations keyed by student ID. Default is term. Required variables are mcid, institution, term, and cip6. Use all rows of your source term data table.
fye_codes Optional character vector of 6-digit CIP codes assigned to FYE programs. Default is "140102". Argument to be used by name.

Implicit arguments. The following implementations yield identical results.

# Required arguments in order and explicitly named
x <- prep_fye_mice(midfield_student = student, midfield_term = term)

# Required arguments in order, but not named
y <- prep_fye_mice(student, term)

# Using the implicit defaults
z <- prep_fye_mice()

# Demonstrate equivalence
check_equiv_frames(x, y)
check_equiv_frames(x, z)

Output. The function returns one row per FYE student keyed by student ID. All variables except ID are returned as factors to meet the requirements of mice().

op <- options()
options(datatable.print.class = TRUE)

# Working data frame
DT <- prep_fye_mice(student, term)
DT

options(op)

Missing data

#| echo: false
N_ever_fye <- length(unique(DT$mcid))
N_to_impute <- sum(is.na(DT$proxy))

The output of prep_fye_mive() should contain missing values in the proxy column only. Other variables are complete. A race/ethnicity or sex value of "unknown" is treated as an observed value, not missing data. And while no values of ID or institution are unknown or missing in this example, such observations (if they existed) would have to be removed.

Checking that all variables except proxy are complete.

# Number of unique IDs
x <- length(unique(DT$mcid))

# Number of complete cases on four variables
y <- sum(complete.cases(DT[, .(mcid, race, sex, institution)]))

# Demonstrate equivalence
all.equal(x, y)

Number of missing observations in proxy.

# Number NAs in proxy
sum(is.na(DT$proxy))

# Percentage NAs in proxy
100 * round(sum(is.na(DT$proxy)) / nrow(DT), 3)

Missing at random (MAR). These missing proxy data are caused by a student's decision to migrate to a non-engineering major or to leave the database. At the time of making that decision, the FYE student would not yet have enrolled in a degree-granting engineering major, thus their decision is unlikely to be related to any specific engineering major.

That a CIP is missing, therefore, is unlikely to be related to a specific CIP value---but may be related to other observations such as institution, race/ethnicity, or sex. Missing data of this type are classified as "missing at random" (MAR) which are suitable for multiple imputation and yield unbiased results [@GraceMartin:2012].

Multiple imputation. Lastly, while 5--10 imputations are generally considered adequate for unbiasedness, Bodner [-@Bodner:2008] recommends having as many imputations as the percentage of missing data.

# Number of proxies to be imputed
(N_impute <- sum(is.na(DT$proxy)))

# Number of observations with complete predictor information
(N_complete <- sum(complete.cases(DT[, .(mcid, race, sex, institution)])))

# Percent missing proxies
(percent_missing <- round(100 * N_impute / N_complete, 3))

As shown here, the overall percentage of missing data is r percent_missing%, suggesting we set the number of imputations to r round(percent_missing, 0).

# For the "m" argument in mice()
(m_imputations <- round(percent_missing, 0))

Chart. The chart displays the percent missing data by category. The institution category isn't used because the practice data contain FYE students in one institution only. The vertical dashed line indicates the r m_imputations% percent missing data overall.

#| echo: false
#| label: fig01
#| fig.asp: 0.35
#| fig-cap: "Figure 1: Percent missing data by category."

# New memory location
x <- copy(DT)

# Convert factors to characters
x <- x[, lapply(.SD, as.character)]

# Overall percentage missing proxies
overall_miss <- 100 * round(nrow(x[is.na(proxy)]) / nrow(x), 3)

# Function for determining percent missing proxies by category
missing_fraction <- function(DT, cat_level) {
  DT[, group := fcase(
    is.na(proxy), "missing",
    default = "not_missing"
  )]
  DT <- DT[, .N, by = c("group", "cat_level")]
  DT <- dcast(DT, cat_level ~ group, value.var = "N")
  DT[, pct_miss := 100 * round(missing / (missing + not_missing), 3)]
  DT <- DT[, .(cat_level, pct_miss)]
}

# Apply to institution category
y <- x[, .(cat_level = institution, proxy)]
inst <- missing_fraction(y, cat_level = "institution")
inst[, category := "institution"]

# Apply to race/ethnicity category
y <- x[, .(cat_level = race, proxy)]
race <- missing_fraction(y, cat_level = "race")
race[, category := "race"]

# Apply to sex category
y <- x[, .(cat_level = sex, proxy)]
sex <- missing_fraction(y, cat_level = "sex")
sex[, category := "sex"]

# Gather results and plot
# Use race and sex only, only one institution
df <- rbindlist(list(race, sex))
ggplot(df, aes(x = pct_miss, y = reorder(cat_level, pct_miss))) +
  geom_vline(xintercept = overall_miss, linetype = 2, color = "gray35") +
  geom_point(size = 1.5) +
  facet_grid(
    rows = vars(reorder(category, pct_miss)),
    as.table = TRUE,
    scales = "free_y",
    space = "free_y"
  ) +
  labs(x = "FYE proxy missing values (%)", y = "") +
  scale_x_continuous(breaks = seq(0, 100, 5), limits = c(30, 55)) +
  theme(panel.grid.minor.x = element_blank())

Setting up `mice()`

The mice package [@vanBuuren+Oudshoorn:2011] implements multiple imputation by chained equations (MICE). MICE is also known as "fully conditional specification" or "sequential regression multiple imputation" and is suitable for categorical variables such as ours [@azur2011]. Our computational procedure follows the approach suggested by Dhana [-@dhana2017].

Standard predictors

Framework. Our first use of mice() is to examine the imputation framework by calling the function with zero iterations on the DT data frame. mice() produces a "multiply-imputed data set", an R object of class "mids".

# Imputation framework
framework <- mice(DT, maxit = 0)
framework

Logged events warning. The printout above includes a warning about two "logged events"---an indication that two variables will not be used as predictors. We can isolate the warning for a closer look,

# Examine the warning
framework$loggedEvents

The two variables are mcid and institution.

mcid was never intended to be a predictor variable. We retain the ID column so that imputed CIP values are assigned to specific IDs.
institution usually is a predictor. In this case, however, the FYE students are all at the same institution---a characteristic of the midfielddata practice data only.

Imputation methods. We look more closely at two elements of this framework. The first is the imputation method vector.

# Imputation method
method_vector <- framework[["method"]]
method_vector

The "polyreg" imputation method (polytomous logistic regression) is appropriate for data, like ours, comprising unordered categorical variables. Variable proxy is imputed using the polyreg method; the other variables, being predictors, are not imputed, thus their methods are empty.

Had the method not been correctly assigned, we would assign it as follows,

# Manually assign the variable(s) being imputed
method_vector[c("proxy")] <- "polyreg"

# Manually assign the variable(s) not being imputed
method_vector[c("mcid", "institution", "race", "sex")] <- ""
method_vector

Predictor matrix. The second element to review is the predictor matrix. A row label identifies the variable being predicted; the columns indicate the predictor variables.

# Imputation predictor matrix
predictor_matrix <- framework[["predictorMatrix"]]
predictor_matrix

However, only those variables assigned a method are imputed. In our case, the only variable to be imputed is proxy, so the only row of this matrix that gets used is the last row.

# Predictor row for this example
predictor_matrix["proxy", ]

The zeros and ones tell us that proxy is going to be predicted by race and sex. Again, the institution variable is not a predictor because the practice data contain one FYE institution only. (This would not be the case if one were using the MIDFIELD research database.)

Had the default setting been incorrect, we can set them manually. Again, note that the bottom row is the only row we need because only the proxy variable is being imputed.

# Manually assign zero columns
predictor_matrix[, c("mcid", "proxy", "institution")] <- 0

# Manually assign predictor columns
predictor_matrix[, c("race", "sex")] <- c(0, 0, 0, 0, 1)
predictor_matrix

If the data included more than one FYE institution, the manual assignment would be,

#| eval: false
# Not run
predictor_matrix[, c("mcid", "proxy")] <- 0
predictor_matrix[, c("race", "sex", "institution")] <- c(0, 0, 0, 0, 1)

Optional predictors

The default predictors set up by prep_fye_mice() are institution (required), race/ethnicity, and sex. If these are acceptable, you can skip to the next section, [Imputing missing values].

Predictors can be edited or added before invoking mice(). As before, ensure that the only missing values are in the proxy column. Other variables are expected to be complete (no NA values). A value of "unknown" in a predictor column, e.g., race/ethnicity or sex, is an acceptable value, not missing data. Observations with missing or unknown values in the ID or institution columns should be removed.

For example, suppose we wish to replace race/ethnicity and sex with a people variable that has four possible values (Domestic Female, Domestic Male, International Female, and International Male) where "domestic" means a US citizen; and we want to add a variable that encodes the year of a student's first term in FYE.

Creating variables. Remove any unknown observations of race/ethnicity and sex to create the desired people variable.

# Data frame to illustrate optional predictors
opt_DT <- copy(DT)

# Factor to character
cols_to_edit <- c("race", "sex")
opt_DT[, (cols_to_edit) := lapply(.SD, as.character), .SDcols = cols_to_edit]

# Filter unknown race and sex
opt_DT <- opt_DT[sex != "Unknown"]
opt_DT <- opt_DT[race != "Other/Unknown"]

# Create origin variable
opt_DT[, origin := fcase(
  race != "International", "Domestic",
  race == "International", "International",
  default = NA_character_
)]
opt_DT <- opt_DT[!is.na(origin)]

# Create people variable
opt_DT[, people := paste(origin, sex)]
opt_DT[, people := as.factor(people)]
opt_DT[, c("race", "sex", "origin") := NULL]

# Display result
setcolorder(opt_DT, c("mcid", "people", "institution", "proxy"))
opt_DT

Check the unique values.

# Display unique people
sort(unique(opt_DT$people))

Adding a variable. Obtain the student's first term in the data set from the term data table using a left-outer join.

# Add all term variables by ID
cols_to_join <- term[, .(mcid, term)]
opt_DT <- cols_to_join[opt_DT, on = c("mcid")]

# Filter for first term
setkeyv(opt_DT, c("mcid", "term"))
opt_DT <- opt_DT[, .SD[1], by = c("mcid")]

# Create year variable
opt_DT[, year := substr(term, 1, 4)]
opt_DT[, year := as.factor(year)]
opt_DT[, term := NULL]

# Display result
setcolorder(opt_DT, c("mcid", "people", "institution", "year", "proxy"))
opt_DT

Filtering. Ensure complete cases except in proxy.

# Identify complete cases in predictor variables
rows_we_want <- complete.cases(opt_DT[, .(mcid, people, institution, year)])

# Filter for complete predictors
opt_DT <- opt_DT[rows_we_want]
opt_DT

Framework for optional predictors.

# Imputation framework
opt_framework <- mice(opt_DT, maxit = 0)
opt_framework

Imputation method for optional predictors.

# Imputation framework
opt_method_vector <- opt_framework[["method"]]
opt_method_vector

Predictor matrix for optional predictors.

# Imputation predictor matrix
opt_predictor_matrix <- opt_framework[["predictorMatrix"]]
opt_predictor_matrix

Percent missing data for setting the number of multiple imputations.

N_impute <- sum(is.na(opt_DT$proxy))
N_fye <- nrow(opt_DT)

# Percent missing data
round(100 * N_impute / N_fye, 0)

Imputing missing values

The three essential arguments for mice() are the DT data frame, the method_vector, and the predictor_matrix. The number of multiple imputations m is set to r m_imputations as discussed in [Missing data]. The default seed argument is NULL, but by setting the seed as shown the vignette results are reproducible. Setting printFlag = TRUE displays progress in the console.

For the practice data, 5 iterations of r m_imputations imputations takes about 3 minutes (depending on your machine). For MIDFIELD research data, however, imputation runs significantly longer.

#| echo: false

# load DT_mids, don't have to repeatedly run mice()
load(here::here("R", "sysdata.rda"))

#| eval: false

# Impute missing proxy data
DT_mids <- mice(
  data = DT,
  m = m_imputations,
  maxit = 5, # default
  method = method_vector,
  predictorMatrix = predictor_matrix,
  seed = 20180624,
  printFlag = TRUE
)

# output in console with printFlag = TRUE
# >  iter imp variable
# >   1   1  proxy
# >   1   2  proxy
# >   1   3  proxy
# >   1   4  proxy
# >   1   5  proxy
# >   ---
# >   5  33  proxy
# >   5  34  proxy
# >   5  35  proxy
# >   5  36  proxy
# >   5  37  proxy

Post-processing

Extracting the result. We apply mice::complete() to extract the data from the mids object. The missing data have been replaced by imputed values.

# Revert to default random number generation
set.seed(NULL)

# Extract data from the mids object
DT <- mice::complete(DT_mids)

# Convert to data.table structure
setDT(DT)
DT <- DT[order(mcid)]
DT

Selecting columns. To use the result, we need only two columns: IDs and the the predicted starting programs.

# Subset the data
DT <- DT[, .(mcid, proxy)]
DT

Recoding. We convert the CIP codes from factor to character.

# Convert factors
DT[, proxy := as.character(proxy)]
DT

Filtering. Proxies are substitutes for students starting in FYE. Thus we filter to remove migrators, retaining the proxies of first-term FYE students only.

# Order term data by ID and term
ordered_term <- term[, .(mcid, term, cip6)]
setorderv(ordered_term, cols = c("mcid", "term"))

# Obtain first term of all students
first_term <- ordered_term[, .SD[1], by = c("mcid")]

# Reduce to first term in FYE
first_term_fye_mcid <- first_term[cip6 == "140102", .(mcid)]

# Inner join to remove migrators from working data frame
DT <- first_term_fye_mcid[DT, on = c("mcid"), nomatch = NULL]
setkey(DT, NULL)
DT

Verify prepared data. To avoid deriving this data frame each time it is needed in other vignettes, the same information is provided in the fye_proxy data frame included with midfieldr. Here we verify that the two data frames have the same content.

# Demonstrate equivalence
check_equiv_frames(DT, fye_proxy)

#| eval: false
#| echo: false

# Run manually if necessary for reproducibility
# Writing internal file sysdata.rda to save DT_mids
# Writing external file fye_proxy

# Internal files
usethis::use_data(
  DT_mids,
  subset_student,
  subset_course,
  subset_term,
  subset_degree,
  internal  = TRUE,
  overwrite = TRUE
)

# External file
fye_proxy <- copy(DT)
usethis::use_data(fye_proxy, overwrite = TRUE)

Assessing FYE proxies

Credibility

Here we summarize the FYE proxy data set to see how many students our algorithm assigned to which engineering majors. Start by extracting the unique set of CIP codes from the proxy data set.

# Identify unique CIP codes in the proxy data
proxy_cips <- sort(unique(fye_proxy$proxy))
proxy_cips

Obtain the program names from the cip data set (provided with midfieldr). We use the 4-digit names that in engineering generally represent department-level programs.

op <- options()
options(datatable.print.topn = 13)
# Obtain the 4-digit program names corresponding to these codes
proxy_program_names <- filter_cip(keep_text = proxy_cips)
proxy_program_names <- proxy_program_names[, .(cip6, program = cip4name)]
proxy_program_names

Join these names to the proxy data set, summarize by program, and order the rows by descending N.

# Join these program names to the proxy data
proxy_programs <- proxy_program_names[fye_proxy[, .(cip6 = proxy)], .(program), on = c("cip6")]

# Count by program and order rows in descending magnitude
proxy_programs <- proxy_programs[, .N, by = c("program")]
setorderv(proxy_programs, order = -1, cols = c("N"))
proxy_programs

options(op)

For comparison, the National Science Foundation (NSF) reports that in 2012, the top seven US engineering majors ranked by enrollment were [@NSF:2014:online]:

Mechanical
Electrical
Civil
Chemical
Industrial
Aerospace
Materials

In Table 1, we show the FYE proxy programs and indicate the equivalent NSF ranking cited above. The assignment of proxies is fairly consistent with the NSF results, though the practice data have a higher frequency of aerospace proxies than expected. Recall that the practice data contain only three institutions while the NSF information is based on nearly 3000 US undergraduate institutions [@NSF:2014:ch2].

#| echo: false
library(gt)
x <- copy(proxy_programs)
x$nsf <- c("1", "2", "3", "", "6", "5", "4", "", "", "7", "", "", "")
setcolorder(x, c("program", "nsf", "N"))
setnames(x,
  old = c("program", "nsf"),
  new = c("Program", "NSF ranking")
)
x |>
  gt() |>
  tab_caption("Table 1: Frequency of FYE proxies using the practice data") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )

We conclude that the imputation is credible at least to the extent that the ranking of the majors is generally consistent with expectations.

Quantifying potential miscounts

The main goal of estimating FYE proxies is to prevent starter miscounts. Here, we assess the potential for miscounts if FYE records are not treated as recommended.

We start with the first_term data frame created earlier (in [Post-processing]) containing the initial term information of all students in the practice data.

# First term of all students
first_term

Identify starters, including FYE proxies, in the four case study programs. (This procedure is more fully developed in the Starters vignette.)

# Join proxies by ID (left join) to first-term data
start <- fye_proxy[first_term, .(mcid, cip6, proxy), on = c("mcid")]

# Distinguish FYE from direct matriculants
start[, matric := fcase(
  is.na(proxy), "direct",
  !is.na(proxy), "fye"
)]

# Create start variable
start[, start := fcase(
  matric == "fye", proxy,
  matric == "direct", cip6
)]

# Filter to retain case study program starters
join_labels <- copy(study_programs)
setnames(join_labels, old = "cip6", new = "start")
start <- join_labels[start, on = c("start")]
start <- start[!is.na(program)]

# Display result
start[order(matric, start)]

This data frame contains all direct-matriculation starters in the case study programs plus the FYE students with one of these programs as their estimated proxy.

Grouping by program and type of matriculation, we can determine the FYE percentage of all starters.

# Summarize
start <- start[, .N, by = c("matric", "program")]

# Transform to row-record form
start <- dcast(start, program ~ matric, value.var = "N")

# Compute FYE as fraction of total
start[, N_starters := direct + fye]
start[, fye_pct := round(100 * fye / N_starters, 1)]
start

The results indicate (for the case study data) a potential under-count of 45% to 97% if FYE proxies are excluded when counting starters.

Reusable code

Given the number of lines of code and the number of case-specific parameters involved, a reusable code section is not provided.

References

MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Potential for starter miscounts

Definitions

Method

Load data

Initial processing

`prep_fye_mice()`

Missing data

Setting up `mice()`

Standard predictors

Optional predictors

Imputing missing values

Post-processing

Assessing FYE proxies

Credibility

Quantifying potential miscounts

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Potential for starter miscounts

Definitions

Method

Load data

Initial processing

prep_fye_mice()

Missing data

Setting up mice()

Standard predictors

Optional predictors

Imputing missing values

Post-processing

Assessing FYE proxies

Credibility

Quantifying potential miscounts

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

`prep_fye_mice()`

Setting up `mice()`