In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-100-")
library("ggplot2")

Graduation rate is a widely used, though flawed, measure of academic achievement.

The American Council on Education estimates that the conventional definition of graduation rate may exclude up to 60% of students at 4-year institutions [@Cook+Hartle:2011]. Nevertheless, as Cook and Hartle explain,

... in the eyes of the public, policy makers, and the media, graduation rate is a clear, simple, and logical---if often misleading---number.

Recognizing that graduation rate is a popular metric, we propose a definition of graduation rate that includes all conventionally excluded students except migrators. You can skip the FYE content in this vignette if your study includes no FYE-style Engineering programs.

This vignette in the MIDFIELD workflow.

Planning
Initial processing
Blocs
Groupings
[Metrics]{.accent}
- [Graduation rate]{.accent}
- Stickiness
Displays

Definitions

[ G=\frac{N_{sg}}{N_s} ]

starter-graduates

: Subset of the starters bloc who are graduates (timely completers) from their starting programs.

Starters and migrators

As they pertain to the graduation rate metric, relationships among starters, migrators, and graduates (timely completers) of a given program P are illustrated in Figure 1.

The overall rectangle represents the set of students ever enrolled in program P.
The interior rectangle represents the set of graduates (timely completers) of program P.
Region 1 (shaded) represents the graduation rate denominator $(N_s)$, the set of starters in program P.
Region 2 (shaded) represents the graduation rate numerator $(N_{sg})$, the subset of starters who are also graduates of program P.
Region 3 (unshaded) represents the set of students excluded from the graduation rate metric, depending on how "program" is defined as discussed below.

#| echo: false
#| label: fig01
#| fig-width: 12
#| fig-asp: 0.7
#| fig-cap: "Figure 1. Graduation rate metric. Starters, migrators, and timely completers."

df_tile <- data.frame(
  x = rep(c(2, 4), 2), # centerline of rectangle
  y = rep(c(1), each = 2), # centerline
  z = factor(rep(1:2))
)

delta <- 0.02

# x-position, center of circled numbers
c1 <- 2.57
c2 <- 1.75 # 1.55
c3 <- 4 # 3.43
c4 <- 4.25

df_box1 <- data.frame(
  x = c(1, 5, 5) + delta * c(-1, 1, 1),
  y = c(0.5, 0.5, 1.5) + delta / 2 * c(-1, -1, 1)
)
df_box2 <- data.frame(
  x = c(1, 1, 5) + delta * c(-1, -1, 1),
  y = c(0.5, 1.5, 1.5) + delta / 2 * c(-1, 1, 1)
)
df_circ <- data.frame(x = c(c1, c2, c3, c4), y = c(1, 1, 1, 1.35))
df_hash <- data.frame(
  x = c(3, 3), xend = c(5, 5),
  y = c(0.5, 1.5), yend = c(1.5, 0.5)
)
df_circ2 <- data.frame(x = 4, y = 0.99)

ggplot(df_tile, aes(x, y)) +
  geom_tile(aes(fill = z)) +
  scale_x_continuous(breaks = seq(0, 16, 2)) +
  scale_fill_manual(
    values = c("#80cdc1", "white"), # "#dfc27d"  "#80cdc1"
    aesthetics = c("colour", "fill")
  ) +
  geom_segment(
    data = df_hash,
    aes(x = x, xend = xend, y = y, yend = yend),
    linewidth = 0.5,
    color = "gray70"
  ) +
  # interior rectangle
  geom_point(
    data = data.frame(x = 3, y = 0.95),
    aes(x = x, y = y),
    shape = 22,
    size = 190,
    color = "gray30",
    fill = "white",
    alpha = 0.4
  ) +
  theme_void() +
  theme(legend.position = "none") +
  geom_line(data = df_box1, aes(x = x, y = y), linewidth = 1, linetype = 2) +
  geom_line(data = df_box2, aes(x = x, y = y), linewidth = 1, linetype = 2) +
  scale_y_continuous(limits = c(0.4, 1.6)) +
  annotate("text",
    x = c(c1, c2, c3, c4, 3, 3),
    y = c(0.925, 1.45, 0.925, 1.45, 1.27, 1.55), # 1.37
    label = c(
      "", # starter-completers
      "starters in program P",
      "", #  migrator-completers
      "migrators into program P",
      "timely completers of program P",
      "ever enrolled in program P"
    ),
    hjust = 0.5,
    vjust = 0.5,
    size = 6
  ) +
  geom_point(
    data = df_circ[1:3, ],
    aes(x = x, y = y),
    shape = 21,
    size = 18,
    fill = c("transparent", "transparent", "white")
  ) +
  annotate("text",
    x = df_circ$x[1:3],
    y = df_circ$y[1:3],
    label = c("2", "1", "3"),
    hjust = 0.5,
    vjust = 0.5,
    size = 6
  )

When calculating graduation rate, whether migrator-graduates are included in the count of graduates depends how a program is defined in terms of CIP codes.

Institution level. Graduation rate computed at the institution level includes all migrators within the institution. For example, starters in Engineering (CIP 14) who graduate in Business (CIP 52) are both starters and timely completers at the institution level. IPEDS defines this rate as the institution completion rate.
2-digit CIP. Graduation rate includes migrator graduates within the same 2-digit CIP. For example, starters in Engineering (CIP 14) graduating in Business (CIP 52) are excluded from the count of Business graduates, but migrators within Engineering (all 6-digit CIP codes starting with 14) are both starters and timely completers in Engineering.
4-digit CIP. Similar to the 2-digit case. For example, starters in Electrical Engineering (CIP 1410) graduating in Mechanical Engineering (CIP 1419) are excluded from the count of Mechanical Engineering graduates, but migrators within Electrical Engineering (all 6-digit CIP codes starting with 1410) are both starters and timely completers in Electrical Engineering.
6-digit CIP. Rarely used. Graduation rate at this CIP level excludes all migrators from the count of graduates.
Multiple CIPs. In some cases, a single program or major includes different 4-digit CIPs. For example, migrators between Systems Engineering (CIP 1427), Industrial Engineering (CIP 1435), Manufacturing Engineering (CIP 1436), and Operations Research (CIP 1437) might be considered both starters and timely completers in a general program of Industrial & Systems Engineering.

Who is a starter?

In the US, the predominant definition of graduation rate is that established by the US Department of Education, Integrated Postsecondary Education Data System (IPEDS). The IPEDS definition underlies the finding cited earlier that a graduation rate metric may exclude up to 60% of students.

Many of the IPEDS exclusions relate to how starters are defined. By expanding the starters definition, MIDFIELD proposes a graduation rate definition that includes all conventionally excluded students except migrators.

graduation rate (IPEDS)

: The fraction of a cohort of full-time, first-time, degree-seeking undergraduates who complete their program within a percentage (100%, 150%, or 200%) of the "normal" time (typically 4 years) as defined by the institution. IPEDS excludes students who attend college part-time, who transfer between institutions, and who start in Winter or Spring terms [@IPEDS:2020].

graduation rate (MIDFIELD)

: The fraction of a cohort of degree-seeking undergraduates who complete their program in a timely manner (typically 6 years). MIDFIELD includes students who attend college part-time, who transfer between institutions, and who start in any term. Table 1 summarizes the comparison between the IPEDS and MIDFIELD graduation rate definitions.

#| echo: false
library("data.table")
library(gt)
wrapr::build_frame(
  "Item", "IPEDS", "MIDFIELD", "MIDFIELD notes" |
    "completion span:", "4, 6, or 8 years", "4, 6, or 8 years", "Typical usage is 6 years" |
    "students admitted in:", "Summer/Fall only", "any term", "" |
    "part-time students are:", "excluded", "included", "Timely completion same as full-time students" |
    "transfer students are:", "excluded", "included", "Timely completion span adjusted for level at entry"
) |>
  gt() |>
  tab_caption("Table 1. Comparing graduation rate definitions") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )

First-Year Engineering (FYE) starters

: We estimate the degree-granting engineering program in which an FYE student would have enrolled had they not been required to enroll in FYE. The FYE proxy, a 6-digit CIP code, denotes the program of which the FYE student can be considered a starter. For additional details, see the vignette FYE proxies.

Method

Demonstrating the following elements of a MIDFIELD workflow.

Planning. The metric is graduation rate. Required blocs are starters and the subset of starters who graduate in their starting major. Grouping variables are program, race/ethnicity, and sex. Programs are the four Engineering programs used throughout.
Initial processing. Filter the student-level records for data sufficiency and degree-seeking.
Blocs. Gather starters, filter by program. Gather graduates, filter by program, filter by starters' IDs and programs.
Groupings. Add grouping variables.
Metrics Summarize by grouping variables and compute graduation rate.
Displays Create multiway chart and results table.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)
library(ggplot2)

Load. Practice datasets. View data dictionaries via ?student, ?term, ?degree.

# Load practice data
data(student, term, degree)

Loads with midfieldr. Prepared data. View data dictionaries via ?study_programs, ?baseline_mcid, ?fye_proxy.

study_programs (derived in Programs).
baseline_mcid (derived in Blocs).
fye_proxy (derived in FYE proxies).

Initial processing

Select (optional). Reduce the number of columns. Code reproduced from Getting started.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)

# Working data frame
DT <- copy(baseline_mcid)
DT

Starters

Starters. The summary code chunk from Starters

# Isolate starting term
DT <- term[DT, .(mcid, term, cip6), on = c("mcid")]
DT <- DT[!cip6 %like% "999999"]
setorderv(DT, cols = c("mcid", "term"))
DT <- DT[, .SD[which.min(term)], by = "mcid"]
DT <- DT[, .(mcid, cip6)]
DT <- unique(DT)

# Continue for starters with FYE
DT <- fye_proxy[DT, .(mcid, cip6, proxy), on = c("mcid")]
DT[, start := fcase(
  cip6 == "140102", proxy,
  cip6 != "140102", cip6
)]
DT <- DT[, .(mcid, start)]

# Filter by program on start
join_labels <- copy(study_programs)
join_labels <- join_labels[, .(program, start = cip6)]
DT <- join_labels[DT, on = c("start"), nomatch = NULL]
DT[, start := NULL]
DT <- unique(DT)

Copy. To prepare for joining with graduates.

# Prepare for joining
setcolorder(DT, c("mcid"))
starters <- copy(DT)
starters

Graduates

Initialize. The data frame of baseline IDs is the intake for this section.

# Working data frame
DT <- copy(baseline_mcid)

Graduates The summary code chunk from Graduates

# Gather graduates, degree CIPs and terms
DT <- add_timely_term(DT, term)
DT <- add_completion_status(DT, degree)
DT <- DT[completion_status == "timely"]
DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")]

# Filter by programs and first degree terms
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT <- DT[, .SD[which.min(term_degree)], by = "mcid"]
DT[, c("cip6", "term_degree") := NULL]
DT <- unique(DT)
DT

#| echo: false
#| eval: false
# finding the closer look IDs
# example 1
DT[starters, on = "mcid", nomatch = NULL][program == i.program]

# example 2
DT[starters, on = "mcid", nomatch = NULL][program != i.program]

# example 3
DT[starters, on = "mcid"][is.na(program)]

Starter-graduates

This section introduces new material---not adapted from the reusable code sections of other vignettes.

For a graduation rate metric, a timely completer is counted among the graduates only if they start and complete the same program.

Filter. Use an inner join to filter the graduates by ID and program to match the IDs and programs of starters.

# Starter-graduates
DT <- starters[DT, on = c("mcid", "program"), nomatch = NULL]

Copy. To prepare for joining with starters.

# Prepare for joining
setcolorder(DT, c("mcid"))
graduates <- copy(DT)
graduates

Closer look

Examining the records of selected students in detail.

Example 1. The student is a starter and a timely completer in Industrial/Systems Engineering (ISE). They appear in both blocs.

# Same ID in different blocs
mcid_we_want <- "MCID3111150194"
starters[mcid == mcid_we_want]

graduates[mcid == mcid_we_want]

Example 2. The student is a starter in Electrical Engineering (EE). They are excluded from the graduation rate starter-graduate bloc because they did not complete EE. From degree we find that they completed CIP 143501 (ISE), one of the study programs. They are also excluded from a count of ISE graduates because they weren't a ISE starter.

# Same ID in different blocs
mcid_we_want <- "MCID3111235261"
starters[mcid == mcid_we_want]

graduates[mcid == mcid_we_want]

degree[mcid == mcid_we_want, .(mcid, cip6)]

Example 3. The student is a starter in Civil Engineering (CE). They are excluded from the graduation rate starter-graduate bloc because they did not complete CE. From degree we find that they completed CIP 521401 (Marketing). They would also be excluded from a count of Marketing graduates because they weren't a Marketing starter.

#| collapse: true

# Same ID in different blocs
mcid_we_want <- "MCID3111158691"
starters[mcid == mcid_we_want]

graduates[mcid == mcid_we_want]

degree[mcid == mcid_we_want, .(mcid, cip6)]

Groupings

One of our grouping variables (program) is already included in the data frames. The next grouping variable is bloc to distinguish starters from graduates when the two data frames are combined.

Add a variable. Label starters and graduates.

# For grouping by bloc
starters[, bloc := "starters"]
graduates[, bloc := "graduates"]

Join. Combine the two blocs to prepare for summarizing. A student starting and graduating in the same program now has two observations in these data: one as a starter and one as a graduate.

# Prepare for summarizing
DT <- rbindlist(list(starters, graduates))
DT

Add variables. Demographics from Groupings

# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]
DT

Graduation rate

Summarize. Count the numbers of observations for each combination of the grouping variables.

# Count observations by group
grouping_variables <- c("bloc", "program", "race", "sex")
DT <- DT[, .N, by = grouping_variables]
setorderv(DT, grouping_variables)
DT

Reshape. Transform to row-record form to set up the graduation rate calculation. Transform the N column into two columns, one for starters and one for graduates.

# Prepare to compute metric
DT <- dcast(DT, program + race + sex ~ bloc, value.var = "N", fill = 0)
DT

Create a variable. Compute the metric.

# Compute metric
DT[, rate := round(100 * graduates / starters, 1)]
DT

Prepare for dissemination

Filter. To preserve the anonymity of the people involved, we remove observations with fewer than N_threshold graduates. With the research data, we typically set this threshold to 10; with the practice data, we demonstrate the procedure using a threshold of 5.

# Preserve anonymity
N_threshold <- 5 # 10 for research data
DT <- DT[graduates >= N_threshold]
DT

Recode. Readers can more readily interpret our charts and tables if the programs are unabbreviated.

# Recode values for chart and table readability
DT[, program := fcase(
  program %like% "CE", "Civil",
  program %like% "EE", "Electrical",
  program %like% "ME", "Mechanical",
  program %like% "ISE", "Industrial/Systems"
)]
DT

Add a variable. We combine race/ethnicity and sex to create a combined grouping variable.

# Create a combined category
DT[, people := paste(race, sex)]
DT[, `:=`(race = NULL, sex = NULL)]
setcolorder(DT, c("program", "people"))
DT

Chart

Order factors. Order the levels of the categories. Code adapted from Multiway data and charts.

# Order the categories
DT <- order_multiway(DT,
  quantity   = "rate",
  categories = c("program", "people"),
  method     = "percent",
  ratio_of   = c("graduates", "starters")
)
DT

Multiway chart. Code adapted from Multiway data and charts.

The vertical reference line is the aggregate graduation rate of the program, independent of race/ethnicity and sex. A missing data marker or missing group indicates the number of graduates was below the threshold set to preserve anonymity---largely an artifact of applying these groupings to practice data.

#| label: fig02
#| fig-asp: 1.1
#| fig-cap: "Figure 2: Graduation rates of four Engineering majors."

ggplot(DT, aes(x = rate, y = people)) +
  facet_wrap(vars(program), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = program_rate), linetype = 2, color = "gray60") +
  geom_point() +
  labs(x = "Graduation rate (%)", y = "") +
  scale_x_continuous(limits = c(20, 90), breaks = seq(0, 100, 10))

Table

Results table. Code adapted from Multiway data and charts.

# Select variables and remove factors
display_table <- copy(DT)
display_table <- display_table[, .(program, people, rate)]
display_table[, people := as.character(people)]
display_table[, program := as.character(program)]

# Construct table
display_table <- dcast(display_table, people ~ program, value.var = "rate")
setnames(display_table,
  old = c("people"),
  new = c("People"),
  skip_absent = TRUE
)
display_table

(Optional) Format the table nearer to publication quality. Here I use the 'gt' package.

library(gt)
display_table |>
  gt() |>
  tab_caption("Table 2: Graduation rates (%) of four Engineering majors") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )

A value of NA indicates a group removed because the number of graduates was below the threshold set to preserve anonymity. As noted earlier, these are largely an artifact of applying these groupings to practice data.

References

MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Starters and migrators

Who is a starter?

Method

Load data

Initial processing

Starters

Graduates

Starter-graduates

Closer look

Groupings

Graduation rate

Prepare for dissemination

Chart

Table

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Starters and migrators

Who is a starter?

Method

Load data

Initial processing

Starters

Graduates

Starter-graduates

Closer look

Groupings

Graduation rate

Prepare for dissemination

Chart

Table

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'