#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-100-")
library("ggplot2")

Graduation rate is a widely used, though flawed, measure of academic achievement.

The American Council on Education estimates that the conventional definition of graduation rate may exclude up to 60% of students at 4-year institutions [@Cook+Hartle:2011]. Nevertheless, as Cook and Hartle explain,

... in the eyes of the public, policy makers, and the media, graduation rate is a clear, simple, and logical---if often misleading---number.

Recognizing that graduation rate is a popular metric, we propose a definition of graduation rate that includes all conventionally excluded students except migrators. You can skip the FYE content in this vignette if your study includes no FYE-style Engineering programs.

This vignette in the MIDFIELD workflow.

  1. Planning
  2. Initial processing
  3. Blocs
  4. Groupings
  5. [Metrics]{.accent}
    • [Graduation rate]{.accent}
    • Stickiness
  6. Displays

Definitions


[ G=\frac{N_{sg}}{N_s} ]



starter-graduates

: Subset of the starters bloc who are graduates (timely completers) from their starting programs.




Starters and migrators

As they pertain to the graduation rate metric, relationships among starters, migrators, and graduates (timely completers) of a given program P are illustrated in Figure 1.

#| echo: false
#| label: fig01
#| fig-width: 12
#| fig-asp: 0.7
#| fig-cap: "Figure 1. Graduation rate metric. Starters, migrators, and timely completers."

df_tile <- data.frame(
  x = rep(c(2, 4), 2), # centerline of rectangle
  y = rep(c(1), each = 2), # centerline
  z = factor(rep(1:2))
)

delta <- 0.02

# x-position, center of circled numbers
c1 <- 2.57
c2 <- 1.75 # 1.55
c3 <- 4 # 3.43
c4 <- 4.25

df_box1 <- data.frame(
  x = c(1, 5, 5) + delta * c(-1, 1, 1),
  y = c(0.5, 0.5, 1.5) + delta / 2 * c(-1, -1, 1)
)
df_box2 <- data.frame(
  x = c(1, 1, 5) + delta * c(-1, -1, 1),
  y = c(0.5, 1.5, 1.5) + delta / 2 * c(-1, 1, 1)
)
df_circ <- data.frame(x = c(c1, c2, c3, c4), y = c(1, 1, 1, 1.35))
df_hash <- data.frame(
  x = c(3, 3), xend = c(5, 5),
  y = c(0.5, 1.5), yend = c(1.5, 0.5)
)
df_circ2 <- data.frame(x = 4, y = 0.99)

ggplot(df_tile, aes(x, y)) +
  geom_tile(aes(fill = z)) +
  scale_x_continuous(breaks = seq(0, 16, 2)) +
  scale_fill_manual(
    values = c("#80cdc1", "white"), # "#dfc27d"  "#80cdc1"
    aesthetics = c("colour", "fill")
  ) +
  geom_segment(
    data = df_hash,
    aes(x = x, xend = xend, y = y, yend = yend),
    linewidth = 0.5,
    color = "gray70"
  ) +
  # interior rectangle
  geom_point(
    data = data.frame(x = 3, y = 0.95),
    aes(x = x, y = y),
    shape = 22,
    size = 190,
    color = "gray30",
    fill = "white",
    alpha = 0.4
  ) +
  theme_void() +
  theme(legend.position = "none") +
  geom_line(data = df_box1, aes(x = x, y = y), linewidth = 1, linetype = 2) +
  geom_line(data = df_box2, aes(x = x, y = y), linewidth = 1, linetype = 2) +
  scale_y_continuous(limits = c(0.4, 1.6)) +
  annotate("text",
    x = c(c1, c2, c3, c4, 3, 3),
    y = c(0.925, 1.45, 0.925, 1.45, 1.27, 1.55), # 1.37
    label = c(
      "", # starter-completers
      "starters in program P",
      "", #  migrator-completers
      "migrators into program P",
      "timely completers of program P",
      "ever enrolled in program P"
    ),
    hjust = 0.5,
    vjust = 0.5,
    size = 6
  ) +
  geom_point(
    data = df_circ[1:3, ],
    aes(x = x, y = y),
    shape = 21,
    size = 18,
    fill = c("transparent", "transparent", "white")
  ) +
  annotate("text",
    x = df_circ$x[1:3],
    y = df_circ$y[1:3],
    label = c("2", "1", "3"),
    hjust = 0.5,
    vjust = 0.5,
    size = 6
  )


When calculating graduation rate, whether migrator-graduates are included in the count of graduates depends how a program is defined in terms of CIP codes.

Who is a starter?

In the US, the predominant definition of graduation rate is that established by the US Department of Education, Integrated Postsecondary Education Data System (IPEDS). The IPEDS definition underlies the finding cited earlier that a graduation rate metric may exclude up to 60% of students.

Many of the IPEDS exclusions relate to how starters are defined. By expanding the starters definition, MIDFIELD proposes a graduation rate definition that includes all conventionally excluded students except migrators.

graduation rate (IPEDS)

: The fraction of a cohort of full-time, first-time, degree-seeking undergraduates who complete their program within a percentage (100%, 150%, or 200%) of the "normal" time (typically 4 years) as defined by the institution. IPEDS excludes students who attend college part-time, who transfer between institutions, and who start in Winter or Spring terms [@IPEDS:2020].

graduation rate (MIDFIELD)

: The fraction of a cohort of degree-seeking undergraduates who complete their program in a timely manner (typically 6 years). MIDFIELD includes students who attend college part-time, who transfer between institutions, and who start in any term. Table 1 summarizes the comparison between the IPEDS and MIDFIELD graduation rate definitions.

#| echo: false
library("data.table")
library(gt)
wrapr::build_frame(
  "Item", "IPEDS", "MIDFIELD", "MIDFIELD notes" |
    "completion span:", "4, 6, or 8 years", "4, 6, or 8 years", "Typical usage is 6 years" |
    "students admitted in:", "Summer/Fall only", "any term", "" |
    "part-time students are:", "excluded", "included", "Timely completion same as full-time students" |
    "transfer students are:", "excluded", "included", "Timely completion span adjusted for level at entry"
) |>
  gt() |>
  tab_caption("Table 1. Comparing graduation rate definitions") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )


First-Year Engineering (FYE) starters

: We estimate the degree-granting engineering program in which an FYE student would have enrolled had they not been required to enroll in FYE. The FYE proxy, a 6-digit CIP code, denotes the program of which the FYE student can be considered a starter. For additional details, see the vignette FYE proxies.

Method

Demonstrating the following elements of a MIDFIELD workflow.

  1. Planning.   The metric is graduation rate. Required blocs are starters and the subset of starters who graduate in their starting major. Grouping variables are program, race/ethnicity, and sex. Programs are the four Engineering programs used throughout.

  2. Initial processing.   Filter the student-level records for data sufficiency and degree-seeking.

  3. Blocs.   Gather starters, filter by program. Gather graduates, filter by program, filter by starters' IDs and programs.

  4. Groupings.   Add grouping variables.

  5. Metrics   Summarize by grouping variables and compute graduation rate.

  6. Displays   Create multiway chart and results table.


Load data

Start.   If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)
library(ggplot2)

Load.   Practice datasets. View data dictionaries via ?student, ?term, ?degree.

# Load practice data
data(student, term, degree)

Loads with midfieldr.   Prepared data. View data dictionaries via ?study_programs, ?baseline_mcid, ?fye_proxy.

Initial processing

Select (optional).   Reduce the number of columns. Code reproduced from Getting started.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)

# Working data frame
DT <- copy(baseline_mcid)
DT

Starters

Starters.   The summary code chunk from Starters

# Isolate starting term
DT <- term[DT, .(mcid, term, cip6), on = c("mcid")]
DT <- DT[!cip6 %like% "999999"]
setorderv(DT, cols = c("mcid", "term"))
DT <- DT[, .SD[which.min(term)], by = "mcid"]
DT <- DT[, .(mcid, cip6)]
DT <- unique(DT)

# Continue for starters with FYE
DT <- fye_proxy[DT, .(mcid, cip6, proxy), on = c("mcid")]
DT[, start := fcase(
  cip6 == "140102", proxy,
  cip6 != "140102", cip6
)]
DT <- DT[, .(mcid, start)]

# Filter by program on start
join_labels <- copy(study_programs)
join_labels <- join_labels[, .(program, start = cip6)]
DT <- join_labels[DT, on = c("start"), nomatch = NULL]
DT[, start := NULL]
DT <- unique(DT)

Copy.   To prepare for joining with graduates.

# Prepare for joining
setcolorder(DT, c("mcid"))
starters <- copy(DT)
starters

Graduates

Initialize.   The data frame of baseline IDs is the intake for this section.

# Working data frame
DT <- copy(baseline_mcid)

Graduates   The summary code chunk from Graduates

# Gather graduates, degree CIPs and terms
DT <- add_timely_term(DT, term)
DT <- add_completion_status(DT, degree)
DT <- DT[completion_status == "timely"]
DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")]

# Filter by programs and first degree terms
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT <- DT[, .SD[which.min(term_degree)], by = "mcid"]
DT[, c("cip6", "term_degree") := NULL]
DT <- unique(DT)
DT
#| echo: false
#| eval: false
# finding the closer look IDs
# example 1
DT[starters, on = "mcid", nomatch = NULL][program == i.program]

# example 2
DT[starters, on = "mcid", nomatch = NULL][program != i.program]

# example 3
DT[starters, on = "mcid"][is.na(program)]

Starter-graduates

This section introduces new material---not adapted from the reusable code sections of other vignettes.

For a graduation rate metric, a timely completer is counted among the graduates only if they start and complete the same program.

Filter.   Use an inner join to filter the graduates by ID and program to match the IDs and programs of starters.

# Starter-graduates
DT <- starters[DT, on = c("mcid", "program"), nomatch = NULL]

Copy.   To prepare for joining with starters.

# Prepare for joining
setcolorder(DT, c("mcid"))
graduates <- copy(DT)
graduates

Closer look

Examining the records of selected students in detail.

Example 1.   The student is a starter and a timely completer in Industrial/Systems Engineering (ISE). They appear in both blocs.

# Same ID in different blocs
mcid_we_want <- "MCID3111150194"
starters[mcid == mcid_we_want]

graduates[mcid == mcid_we_want]

Example 2.   The student is a starter in Electrical Engineering (EE). They are excluded from the graduation rate starter-graduate bloc because they did not complete EE. From degree we find that they completed CIP 143501 (ISE), one of the study programs. They are also excluded from a count of ISE graduates because they weren't a ISE starter.

# Same ID in different blocs
mcid_we_want <- "MCID3111235261"
starters[mcid == mcid_we_want]

graduates[mcid == mcid_we_want]

degree[mcid == mcid_we_want, .(mcid, cip6)]

Example 3.   The student is a starter in Civil Engineering (CE). They are excluded from the graduation rate starter-graduate bloc because they did not complete CE. From degree we find that they completed CIP 521401 (Marketing). They would also be excluded from a count of Marketing graduates because they weren't a Marketing starter.

#| collapse: true

# Same ID in different blocs
mcid_we_want <- "MCID3111158691"
starters[mcid == mcid_we_want]

graduates[mcid == mcid_we_want]

degree[mcid == mcid_we_want, .(mcid, cip6)]

Groupings

One of our grouping variables (program) is already included in the data frames. The next grouping variable is bloc to distinguish starters from graduates when the two data frames are combined.

Add a variable.   Label starters and graduates.

# For grouping by bloc
starters[, bloc := "starters"]
graduates[, bloc := "graduates"]

Join.   Combine the two blocs to prepare for summarizing. A student starting and graduating in the same program now has two observations in these data: one as a starter and one as a graduate.

# Prepare for summarizing
DT <- rbindlist(list(starters, graduates))
DT

Add variables.   Demographics from Groupings

# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]
DT

Graduation rate

Summarize.   Count the numbers of observations for each combination of the grouping variables.

# Count observations by group
grouping_variables <- c("bloc", "program", "race", "sex")
DT <- DT[, .N, by = grouping_variables]
setorderv(DT, grouping_variables)
DT

Reshape.   Transform to row-record form to set up the graduation rate calculation. Transform the N column into two columns, one for starters and one for graduates.

# Prepare to compute metric
DT <- dcast(DT, program + race + sex ~ bloc, value.var = "N", fill = 0)
DT

Create a variable.   Compute the metric.

# Compute metric
DT[, rate := round(100 * graduates / starters, 1)]
DT

Prepare for dissemination

Filter.   To preserve the anonymity of the people involved, we remove observations with fewer than N_threshold graduates. With the research data, we typically set this threshold to 10; with the practice data, we demonstrate the procedure using a threshold of 5.

# Preserve anonymity
N_threshold <- 5 # 10 for research data
DT <- DT[graduates >= N_threshold]
DT

Recode.   Readers can more readily interpret our charts and tables if the programs are unabbreviated.

# Recode values for chart and table readability
DT[, program := fcase(
  program %like% "CE", "Civil",
  program %like% "EE", "Electrical",
  program %like% "ME", "Mechanical",
  program %like% "ISE", "Industrial/Systems"
)]
DT

Add a variable.   We combine race/ethnicity and sex to create a combined grouping variable.

# Create a combined category
DT[, people := paste(race, sex)]
DT[, `:=`(race = NULL, sex = NULL)]
setcolorder(DT, c("program", "people"))
DT

Chart

Order factors.   Order the levels of the categories. Code adapted from Multiway data and charts.

# Order the categories
DT <- order_multiway(DT,
  quantity   = "rate",
  categories = c("program", "people"),
  method     = "percent",
  ratio_of   = c("graduates", "starters")
)
DT

Multiway chart.   Code adapted from Multiway data and charts.

The vertical reference line is the aggregate graduation rate of the program, independent of race/ethnicity and sex. A missing data marker or missing group indicates the number of graduates was below the threshold set to preserve anonymity---largely an artifact of applying these groupings to practice data.

#| label: fig02
#| fig-asp: 1.1
#| fig-cap: "Figure 2: Graduation rates of four Engineering majors."

ggplot(DT, aes(x = rate, y = people)) +
  facet_wrap(vars(program), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = program_rate), linetype = 2, color = "gray60") +
  geom_point() +
  labs(x = "Graduation rate (%)", y = "") +
  scale_x_continuous(limits = c(20, 90), breaks = seq(0, 100, 10))

Table

Results table.   Code adapted from Multiway data and charts.

# Select variables and remove factors
display_table <- copy(DT)
display_table <- display_table[, .(program, people, rate)]
display_table[, people := as.character(people)]
display_table[, program := as.character(program)]

# Construct table
display_table <- dcast(display_table, people ~ program, value.var = "rate")
setnames(display_table,
  old = c("people"),
  new = c("People"),
  skip_absent = TRUE
)
display_table

(Optional) Format the table nearer to publication quality. Here I use the 'gt' package.

library(gt)
display_table |>
  gt() |>
  tab_caption("Table 2: Graduation rates (%) of four Engineering majors") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )

A value of NA indicates a group removed because the number of graduates was below the threshold set to preserve anonymity. As noted earlier, these are largely an artifact of applying these groupings to practice data.

References




MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.