#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-100-") library("ggplot2")
Graduation rate is a widely used, though flawed, measure of academic achievement.
The American Council on Education estimates that the conventional definition of graduation rate may exclude up to 60% of students at 4-year institutions [@Cook+Hartle:2011]. Nevertheless, as Cook and Hartle explain,
... in the eyes of the public, policy makers, and the media, graduation rate is a clear, simple, and logical---if often misleading---number.
Recognizing that graduation rate is a popular metric, we propose a definition of graduation rate that includes all conventionally excluded students except migrators. You can skip the FYE content in this vignette if your study includes no FYE-style Engineering programs.
This vignette in the MIDFIELD workflow.
[ G=\frac{N_{sg}}{N_s} ]
starter-graduates
: Subset of the starters bloc who are graduates (timely completers) from their starting programs.
As they pertain to the graduation rate metric, relationships among starters, migrators, and graduates (timely completers) of a given program P are illustrated in Figure 1.
The overall rectangle represents the set of students ever enrolled in program P.
The interior rectangle represents the set of graduates (timely completers) of program P.
Region 1 (shaded) represents the graduation rate denominator $(N_s)$, the set of starters in program P.
Region 2 (shaded) represents the graduation rate numerator $(N_{sg})$, the subset of starters who are also graduates of program P.
Region 3 (unshaded) represents the set of students excluded from the graduation rate metric, depending on how "program" is defined as discussed below.
#| echo: false #| label: fig01 #| fig-width: 12 #| fig-asp: 0.7 #| fig-cap: "Figure 1. Graduation rate metric. Starters, migrators, and timely completers." df_tile <- data.frame( x = rep(c(2, 4), 2), # centerline of rectangle y = rep(c(1), each = 2), # centerline z = factor(rep(1:2)) ) delta <- 0.02 # x-position, center of circled numbers c1 <- 2.57 c2 <- 1.75 # 1.55 c3 <- 4 # 3.43 c4 <- 4.25 df_box1 <- data.frame( x = c(1, 5, 5) + delta * c(-1, 1, 1), y = c(0.5, 0.5, 1.5) + delta / 2 * c(-1, -1, 1) ) df_box2 <- data.frame( x = c(1, 1, 5) + delta * c(-1, -1, 1), y = c(0.5, 1.5, 1.5) + delta / 2 * c(-1, 1, 1) ) df_circ <- data.frame(x = c(c1, c2, c3, c4), y = c(1, 1, 1, 1.35)) df_hash <- data.frame( x = c(3, 3), xend = c(5, 5), y = c(0.5, 1.5), yend = c(1.5, 0.5) ) df_circ2 <- data.frame(x = 4, y = 0.99) ggplot(df_tile, aes(x, y)) + geom_tile(aes(fill = z)) + scale_x_continuous(breaks = seq(0, 16, 2)) + scale_fill_manual( values = c("#80cdc1", "white"), # "#dfc27d" "#80cdc1" aesthetics = c("colour", "fill") ) + geom_segment( data = df_hash, aes(x = x, xend = xend, y = y, yend = yend), linewidth = 0.5, color = "gray70" ) + # interior rectangle geom_point( data = data.frame(x = 3, y = 0.95), aes(x = x, y = y), shape = 22, size = 190, color = "gray30", fill = "white", alpha = 0.4 ) + theme_void() + theme(legend.position = "none") + geom_line(data = df_box1, aes(x = x, y = y), linewidth = 1, linetype = 2) + geom_line(data = df_box2, aes(x = x, y = y), linewidth = 1, linetype = 2) + scale_y_continuous(limits = c(0.4, 1.6)) + annotate("text", x = c(c1, c2, c3, c4, 3, 3), y = c(0.925, 1.45, 0.925, 1.45, 1.27, 1.55), # 1.37 label = c( "", # starter-completers "starters in program P", "", # migrator-completers "migrators into program P", "timely completers of program P", "ever enrolled in program P" ), hjust = 0.5, vjust = 0.5, size = 6 ) + geom_point( data = df_circ[1:3, ], aes(x = x, y = y), shape = 21, size = 18, fill = c("transparent", "transparent", "white") ) + annotate("text", x = df_circ$x[1:3], y = df_circ$y[1:3], label = c("2", "1", "3"), hjust = 0.5, vjust = 0.5, size = 6 )
When calculating graduation rate, whether migrator-graduates are included in the count of graduates depends how a program is defined in terms of CIP codes.
Institution level. Graduation rate computed at the institution level includes all migrators within the institution. For example, starters in Engineering (CIP 14) who graduate in Business (CIP 52) are both starters and timely completers at the institution level. IPEDS defines this rate as the institution completion rate.
2-digit CIP. Graduation rate includes migrator graduates within the same 2-digit CIP. For example, starters in Engineering (CIP 14) graduating in Business (CIP 52) are excluded from the count of Business graduates, but migrators within Engineering (all 6-digit CIP codes starting with 14) are both starters and timely completers in Engineering.
4-digit CIP. Similar to the 2-digit case. For example, starters in Electrical Engineering (CIP 1410) graduating in Mechanical Engineering (CIP 1419) are excluded from the count of Mechanical Engineering graduates, but migrators within Electrical Engineering (all 6-digit CIP codes starting with 1410) are both starters and timely completers in Electrical Engineering.
6-digit CIP. Rarely used. Graduation rate at this CIP level excludes all migrators from the count of graduates.
Multiple CIPs. In some cases, a single program or major includes different 4-digit CIPs. For example, migrators between Systems Engineering (CIP 1427), Industrial Engineering (CIP 1435), Manufacturing Engineering (CIP 1436), and Operations Research (CIP 1437) might be considered both starters and timely completers in a general program of Industrial & Systems Engineering.
In the US, the predominant definition of graduation rate is that established by the US Department of Education, Integrated Postsecondary Education Data System (IPEDS). The IPEDS definition underlies the finding cited earlier that a graduation rate metric may exclude up to 60% of students.
Many of the IPEDS exclusions relate to how starters are defined. By expanding the starters definition, MIDFIELD proposes a graduation rate definition that includes all conventionally excluded students except migrators.
graduation rate (IPEDS)
: The fraction of a cohort of full-time, first-time, degree-seeking undergraduates who complete their program within a percentage (100%, 150%, or 200%) of the "normal" time (typically 4 years) as defined by the institution. IPEDS excludes students who attend college part-time, who transfer between institutions, and who start in Winter or Spring terms [@IPEDS:2020].
graduation rate (MIDFIELD)
: The fraction of a cohort of degree-seeking undergraduates who complete their program in a timely manner (typically 6 years). MIDFIELD includes students who attend college part-time, who transfer between institutions, and who start in any term. Table 1 summarizes the comparison between the IPEDS and MIDFIELD graduation rate definitions.
#| echo: false library("data.table") library(gt) wrapr::build_frame( "Item", "IPEDS", "MIDFIELD", "MIDFIELD notes" | "completion span:", "4, 6, or 8 years", "4, 6, or 8 years", "Typical usage is 6 years" | "students admitted in:", "Summer/Fall only", "any term", "" | "part-time students are:", "excluded", "included", "Timely completion same as full-time students" | "transfer students are:", "excluded", "included", "Timely completion span adjusted for level at entry" ) |> gt() |> tab_caption("Table 1. Comparing graduation rate definitions") |> tab_options(table.font.size = "small") |> opt_stylize(style = 1, color = "gray") |> tab_style( style = list(cell_fill(color = "#c7eae5")), locations = cells_column_labels(columns = everything()) )
First-Year Engineering (FYE) starters
: We estimate the degree-granting engineering program in which an FYE student would have enrolled had they not been required to enroll in FYE. The FYE proxy, a 6-digit CIP code, denotes the program of which the FYE student can be considered a starter. For additional details, see the vignette FYE proxies.
Demonstrating the following elements of a MIDFIELD workflow.
Planning. The metric is graduation rate. Required blocs are starters and the subset of starters who graduate in their starting major. Grouping variables are program, race/ethnicity, and sex. Programs are the four Engineering programs used throughout.
Initial processing. Filter the student-level records for data sufficiency and degree-seeking.
Blocs. Gather starters, filter by program. Gather graduates, filter by program, filter by starters' IDs and programs.
Groupings. Add grouping variables.
Metrics Summarize by grouping variables and compute graduation rate.
Displays Create multiway chart and results table.
Start. If you are writing your own script to follow along, we use these packages in this article:
library(midfieldr) library(midfielddata) library(data.table) library(ggplot2)
Load. Practice datasets. View data dictionaries via ?student
, ?term
, ?degree
.
# Load practice data data(student, term, degree)
Loads with midfieldr. Prepared data. View data dictionaries via ?study_programs
, ?baseline_mcid
, ?fye_proxy
.
study_programs
(derived in Programs).
baseline_mcid
(derived in Blocs).
fye_proxy
(derived in FYE proxies).
Select (optional). Reduce the number of columns. Code reproduced from Getting started.
# Optional. Copy of source files with all variables source_student <- copy(student) source_term <- copy(term) source_degree <- copy(degree) # Optional. Select variables required by midfieldr functions student <- select_required(source_student) term <- select_required(source_term) degree <- select_required(source_degree)
# Working data frame DT <- copy(baseline_mcid) DT
Starters. The summary code chunk from Starters
# Isolate starting term DT <- term[DT, .(mcid, term, cip6), on = c("mcid")] DT <- DT[!cip6 %like% "999999"] setorderv(DT, cols = c("mcid", "term")) DT <- DT[, .SD[which.min(term)], by = "mcid"] DT <- DT[, .(mcid, cip6)] DT <- unique(DT) # Continue for starters with FYE DT <- fye_proxy[DT, .(mcid, cip6, proxy), on = c("mcid")] DT[, start := fcase( cip6 == "140102", proxy, cip6 != "140102", cip6 )] DT <- DT[, .(mcid, start)] # Filter by program on start join_labels <- copy(study_programs) join_labels <- join_labels[, .(program, start = cip6)] DT <- join_labels[DT, on = c("start"), nomatch = NULL] DT[, start := NULL] DT <- unique(DT)
Copy. To prepare for joining with graduates.
# Prepare for joining setcolorder(DT, c("mcid")) starters <- copy(DT) starters
Initialize. The data frame of baseline IDs is the intake for this section.
# Working data frame DT <- copy(baseline_mcid)
Graduates The summary code chunk from Graduates
# Gather graduates, degree CIPs and terms DT <- add_timely_term(DT, term) DT <- add_completion_status(DT, degree) DT <- DT[completion_status == "timely"] DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")] # Filter by programs and first degree terms DT <- study_programs[DT, on = c("cip6"), nomatch = NULL] DT <- DT[, .SD[which.min(term_degree)], by = "mcid"] DT[, c("cip6", "term_degree") := NULL] DT <- unique(DT) DT
#| echo: false #| eval: false # finding the closer look IDs # example 1 DT[starters, on = "mcid", nomatch = NULL][program == i.program] # example 2 DT[starters, on = "mcid", nomatch = NULL][program != i.program] # example 3 DT[starters, on = "mcid"][is.na(program)]
This section introduces new material---not adapted from the reusable code sections of other vignettes.
For a graduation rate metric, a timely completer is counted among the graduates only if they start and complete the same program.
Filter. Use an inner join to filter the graduates by ID and program to match the IDs and programs of starters.
# Starter-graduates DT <- starters[DT, on = c("mcid", "program"), nomatch = NULL]
Copy. To prepare for joining with starters.
# Prepare for joining setcolorder(DT, c("mcid")) graduates <- copy(DT) graduates
Examining the records of selected students in detail.
Example 1. The student is a starter and a timely completer in Industrial/Systems Engineering (ISE). They appear in both blocs.
# Same ID in different blocs mcid_we_want <- "MCID3111150194" starters[mcid == mcid_we_want] graduates[mcid == mcid_we_want]
Example 2. The student is a starter in Electrical Engineering (EE). They are excluded from the graduation rate starter-graduate bloc because they did not complete EE. From degree
we find that they completed CIP 143501 (ISE), one of the study programs. They are also excluded from a count of ISE graduates because they weren't a ISE starter.
# Same ID in different blocs mcid_we_want <- "MCID3111235261" starters[mcid == mcid_we_want] graduates[mcid == mcid_we_want] degree[mcid == mcid_we_want, .(mcid, cip6)]
Example 3. The student is a starter in Civil Engineering (CE). They are excluded from the graduation rate starter-graduate bloc because they did not complete CE. From degree
we find that they completed CIP 521401 (Marketing). They would also be excluded from a count of Marketing graduates because they weren't a Marketing starter.
#| collapse: true # Same ID in different blocs mcid_we_want <- "MCID3111158691" starters[mcid == mcid_we_want] graduates[mcid == mcid_we_want] degree[mcid == mcid_we_want, .(mcid, cip6)]
One of our grouping variables (program
) is already included in the data frames. The next grouping variable is bloc
to distinguish starters from graduates when the two data frames are combined.
Add a variable. Label starters and graduates.
# For grouping by bloc starters[, bloc := "starters"] graduates[, bloc := "graduates"]
Join. Combine the two blocs to prepare for summarizing. A student starting and graduating in the same program now has two observations in these data: one as a starter and one as a graduate.
# Prepare for summarizing DT <- rbindlist(list(starters, graduates)) DT
Add variables. Demographics from Groupings
# Join race/ethnicity and sex cols_we_want <- student[, .(mcid, race, sex)] DT <- cols_we_want[DT, on = c("mcid")] DT
Summarize. Count the numbers of observations for each combination of the grouping variables.
# Count observations by group grouping_variables <- c("bloc", "program", "race", "sex") DT <- DT[, .N, by = grouping_variables] setorderv(DT, grouping_variables) DT
Reshape. Transform to row-record form to set up the graduation rate calculation. Transform the N column into two columns, one for starters and one for graduates.
# Prepare to compute metric DT <- dcast(DT, program + race + sex ~ bloc, value.var = "N", fill = 0) DT
Create a variable. Compute the metric.
# Compute metric DT[, rate := round(100 * graduates / starters, 1)] DT
Filter. To preserve the anonymity of the people involved, we remove observations with fewer than N_threshold
graduates. With the research data, we typically set this threshold to 10; with the practice data, we demonstrate the procedure using a threshold of 5.
# Preserve anonymity N_threshold <- 5 # 10 for research data DT <- DT[graduates >= N_threshold] DT
Recode. Readers can more readily interpret our charts and tables if the programs are unabbreviated.
# Recode values for chart and table readability DT[, program := fcase( program %like% "CE", "Civil", program %like% "EE", "Electrical", program %like% "ME", "Mechanical", program %like% "ISE", "Industrial/Systems" )] DT
Add a variable. We combine race/ethnicity and sex to create a combined grouping variable.
# Create a combined category DT[, people := paste(race, sex)] DT[, `:=`(race = NULL, sex = NULL)] setcolorder(DT, c("program", "people")) DT
Order factors. Order the levels of the categories. Code adapted from Multiway data and charts.
# Order the categories DT <- order_multiway(DT, quantity = "rate", categories = c("program", "people"), method = "percent", ratio_of = c("graduates", "starters") ) DT
Multiway chart. Code adapted from Multiway data and charts.
The vertical reference line is the aggregate graduation rate of the program, independent of race/ethnicity and sex. A missing data marker or missing group indicates the number of graduates was below the threshold set to preserve anonymity---largely an artifact of applying these groupings to practice data.
#| label: fig02 #| fig-asp: 1.1 #| fig-cap: "Figure 2: Graduation rates of four Engineering majors." ggplot(DT, aes(x = rate, y = people)) + facet_wrap(vars(program), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = program_rate), linetype = 2, color = "gray60") + geom_point() + labs(x = "Graduation rate (%)", y = "") + scale_x_continuous(limits = c(20, 90), breaks = seq(0, 100, 10))
Results table. Code adapted from Multiway data and charts.
# Select variables and remove factors display_table <- copy(DT) display_table <- display_table[, .(program, people, rate)] display_table[, people := as.character(people)] display_table[, program := as.character(program)] # Construct table display_table <- dcast(display_table, people ~ program, value.var = "rate") setnames(display_table, old = c("people"), new = c("People"), skip_absent = TRUE ) display_table
(Optional) Format the table nearer to publication quality. Here I use the 'gt' package.
library(gt) display_table |> gt() |> tab_caption("Table 2: Graduation rates (%) of four Engineering majors") |> tab_options(table.font.size = "small") |> opt_stylize(style = 1, color = "gray") |> tab_style( style = list(cell_fill(color = "#c7eae5")), locations = cells_column_labels(columns = everything()) )
A value of NA indicates a group removed because the number of graduates was below the threshold set to preserve anonymity. As noted earlier, these are largely an artifact of applying these groupings to practice data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.