#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-020-") library(ggplot2)
The time span (or range) of MIDFIELD data varies by institution. At the upper and lower limits of a data range, a potential for false counts exists when a metric (such as graduation rate) requires knowledge of timely degree completion. For such metrics, student records that produce problematic results due to insufficient data are nearly always excluded from study.
This article in the MIDFIELD workflow.
For students admitted too near the upper limit of their institution's data range, the available data cover an insufficient number of years to know if completion is timely. To illustrate, in the figure we compare two students admitted in different terms with representative time spans shown for timely completion. In this scenario, we assume institution data is available from 1986 to 1996.
#| echo: false #| label: fig01 #| fig-width: 8.4 #| fig-cap: "Figure 1: Upper limit data sufficiency." # upper limit chart # parameters callout_color <- "gray60" callout_line_size <- 0.3 anno_size <- 3.5 # 3.5 approx 10 point vert_baseline <- 1.07 del <- 0.15 # delta y for dimension lines # vertical dim lines vert_dim_lines <- function(dt, cols) { data <- copy(dt) data[, .SD, .SDcols = cols] geom_segment( data = data, na.rm = TRUE, aes( x = get(cols[1]), xend = get(cols[1]), y = get(cols[2]) - del, yend = get(cols[2]) + del ), color = callout_color, linewidth = callout_line_size, linetype = 1 ) } # construct coordinates coord <- wrapr::build_frame( "id", "row", "dash1", "dash2", "arrow1", "arrow2", "arrowlabel", "labelx" | "range", 3, NA, NA, 1986, 1996, "institution data range", 1989 | "A", 2, NA, NA, 1988, 1994, "TC term", 1994 | "B", 1, 1996, 1998, 1993, 1996, "TC term", 1998 | "limit", 0.2, NA, NA, NA, NA, "upper data limit", 1996 ) |> data.table() # x scale ticks breaks_x <- sort(unique(coord[, round(c(dash1, dash2, arrow1, arrow2), 0)], na.rm = TRUE)) ggplot() + scale_x_continuous(breaks = breaks_x, limits = c(1983.5, 1999)) + scale_y_continuous(limits = c(-0.5, 3.5)) + labs(x = "Year", y = "") + theme_light() + theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank() ) + geom_segment( # arrows data = coord[, .(row, arrow1, arrow2)], na.rm = TRUE, aes(x = arrow1, xend = arrow2, y = row, yend = row), color = callout_color, linewidth = callout_line_size, arrow = arrow( type = "closed", length = unit(c(2, 2, 0), "mm"), ends = c("both", rep("last", nrow(coord) - 1)) ), arrow.fill = callout_color ) + geom_segment( # dashed lines data = coord[, .(row, dash1, dash2)], na.rm = TRUE, aes(x = dash1, xend = dash2, y = row, yend = row), color = callout_color, linewidth = 1.5 * callout_line_size, linetype = 2 ) + geom_point( # entry dots data = coord[id %chin% LETTERS, .(id, row, arrow1)], na.rm = TRUE, aes(x = arrow1, y = row), size = 2, color = callout_color ) + geom_segment( aes( x = 1996, xend = 1996, y = min(coord$row) - 2 * del, yend = max(coord$row) + 2 * del ), color = callout_color, linewidth = callout_line_size ) + vert_dim_lines(dt = coord, cols = c("dash2", "row")) + vert_dim_lines(dt = coord[id %chin% LETTERS], cols = c("arrow2", "row")) + vert_dim_lines(dt = coord[!id %chin% LETTERS], cols = c("arrow1", "row")) + geom_text( data = coord[, .(row, arrowlabel, labelx)], na.rm = TRUE, aes(x = labelx, y = row, label = arrowlabel), vjust = 1.5, hjust = -0.1 ) + geom_text( data = coord[id %chin% LETTERS, .(id, row)], na.rm = TRUE, aes(x = 1986, y = row, label = paste0("student ", id, ":")), vjust = 0.5, hjust = 1.25 )
Student A
: Student A enters in 1988 with a timely completion (TC) term in 1994. In both of the following cases, the data sufficiency criterion is satisfied and the records are included in a study.
A-1: First time in college (FTIC), so we know their first term is their entry term (i.e., they are not a continuing student) and we can determine their TC term.
A-2: Transfer student, and we know their first term in a MIDFIELD institution. We have no knowledge of how much time was spent accumulating their pre-MIDFIELD credit hours, but we can estimate a TC term with respect to their "level" at entry, that is, entering as a first-year student, second-year student, etc.
Student B
: Student B enters in 1993 with a TC term in 1998, two years beyond the range of the data. We have several possible cases,
B-1: Before the data limit, the student completes their program (timely completion, known record)
B-2: Before the data limit, the student leaves the data base (non-completion, known record)
B-3: After the data limit, the student completes before their TC term (timely completion, no record)
B-4: After the data limit, the student completes after their TC term or fails to complete (late completion or non-completion, no record)
Because the outcomes in cases B-3 and B-4 are not in the record, to include case B-1 and B-2 invariably produces a miscount of timely completers, late completers, and non-completers. Thus all student B records are excluded from the study.
To determine data sufficiency record exclusions at the lower limit of the data range, we compare a student's first term (non-summer) to the first term of the data range (also non-summer). When these two terms are identical, the complete unit record is excluded. We illustrate with the three scenarios described below.
#| echo: false #| label: fig02 #| fig.width: 8.4 #| fig-cap: "Figure 2: Lower limit data sufficiency." # lower limit chart # construct coordinates coord <- wrapr::build_frame( "id", "row", "dash1", "dash2", "arrow1", "arrow2", "arrowlabel", "labelx" | "range", 4, NA, NA, 1986, 1996, "institution data range", 1989 | "A", 3, NA, NA, 1986.5, 1992.5, "TC term", 1992.5 | "C", 2, 1984, 1986, 1986, 1992, "TC term", 1992 | "D", 1, 1984, 1985.8, NA, NA, "lower data limit", 1986 ) |> data.table() # x scale ticks breaks_x <- sort(unique(coord[, round(c(dash1, dash2, arrow1, arrow2), 0)], na.rm = TRUE)) # lower limit graph ggplot() + scale_x_continuous(breaks = breaks_x, limits = c(1981.5, 1997)) + scale_y_continuous(limits = c(0.5, 4.5)) + labs(x = "Year", y = "") + theme_light() + theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank() ) + geom_segment( # arrows data = coord[, .(row, arrow1, arrow2)], na.rm = TRUE, aes(x = arrow1, xend = arrow2, y = row, yend = row), color = callout_color, linewidth = callout_line_size, arrow = arrow( type = "closed", length = unit(2, "mm"), ends = c("both", rep("last", nrow(coord) - 1)) ), arrow.fill = callout_color ) + geom_segment( # dashed lines data = coord[, .(row, dash1, dash2)], na.rm = TRUE, aes(x = dash1, xend = dash2, y = row, yend = row), color = callout_color, linewidth = 1.5 * callout_line_size, linetype = 2 ) + geom_point( # entry dots data = coord[id %chin% LETTERS, .(id, row, arrow1)], na.rm = TRUE, aes(x = arrow1, y = row), size = 2, color = callout_color ) + geom_segment( aes( x = 1986, xend = 1986, y = min(coord$row) - 2 * del, yend = max(coord$row) + 2 * del ), color = callout_color, linewidth = callout_line_size ) + vert_dim_lines(dt = coord, cols = c("dash1", "row")) + vert_dim_lines(dt = coord, cols = c("dash2", "row")) + vert_dim_lines(dt = coord, cols = c("arrow2", "row")) + geom_text( data = coord[, .(row, arrowlabel, labelx)], na.rm = TRUE, aes(x = labelx, y = row, label = arrowlabel), vjust = 1.5, hjust = -0.1 ) + geom_text( data = coord[id %chin% LETTERS, .(id, row)], na.rm = TRUE, aes(x = 1984, y = row, label = paste0("student ", id, ":")), vjust = 0.5, hjust = 1.25 )
Student A
: Like Student A in Figure 1, they enter the dataset in a term following the data lower limit and are included in a study.
Student C
: Student C enters the institution before the lower limit of the data range (a "continuing" student) or they enter the institution at the lower limit precisely.
C-1: If student C is continuing, regardless of status (FTIC or transfer), making an estimate of their TC term invariably leads to false counts because we have no knowledge of how much time was spent accumulating credit hours at their MIDFIELD institution before the lower data limit. Including C-1 would also produce false counts because of student D (discussed below).
C-2: If student C is not continuing, that is, their first time entry to a MIDFIELD institution is at the lower data limit (here, 1986), we would include them in a study if we could. Unfortunately, we cannot distinguish them from continuing students. Having to exclude C-1 inherently excludes C-2 as well.
Student D
: Student D enters the institution at the same time as continuing student C but leaves the database before the data lower limit term.
D-1: Student D did not timely-complete their program. In this case, if we include student C our count of non-completers is low (D-1 cases are missing), resulting in an inflated ratio of completers to non-completers.
D-2: Student D did timely-complete their program. Here, if we include student C our count of completers is low (D-2 cases are missing), resulting in a diminished ratio of completers to non-completers.
The balance of these two effects is unknowable. Since student D cannot possibly be included, Student C must also be excluded.
Specific student unit records at the upper and lower limits of an institution's data range must be excluded to prevent false counts due to insufficient data. Based on the discussion above, two specific filters are implemented:
Lower limit. All IDs extant in the non-summer lower limit of an institution’s data range are labeled for possible exclusion.
Upper limit. All IDs for which the timely completion term exceeds the upper limit of the institution's data range are labeled for possible exclusion.
Start. If you are writing your own script to follow along, we use these packages in this article:
library(midfieldr) library(midfielddata) library(data.table)
Load. Practice datasets. View data dictionary via ?term
.
# Load data data(term)
Select (optional). Reduce the number of columns. Code reproduced from Getting started.
# Copy of source files with all variables source_term <- copy(term) # Select variables required by midfieldr functions term <- select_required(source_term)
Initialize. Assign a working data frame.
# Working data frame DT <- copy(term) DT
Select. The ID column is required. The institution column is not, but is convenient when taking a closer look at the results.
# Retain the minimum number of columns DT <- DT[, .(mcid, institution)]
Filter. Retain unique IDs.
# Filter for unique IDs DT <- unique(DT) DT
add_timely_term()
Add a column to a data frame of student-level data that indicates the latest term by which degree completion would be considered timely for every student.
Arguments.
dframe
Data frame of student-level records keyed by student ID. Required variable (column) is mcid
.
midfield_term
Data frame of student-level term observations keyed by student ID. Default is term
. Required variables (columns) are mcid
, term
, and level
.
span
Optional integer scalar, number of years to define timely completion. Commonly used values are are 100%, 150%, and 200% of sched_span
. Default 6 years. Argument to be used by name.
sched_span
Optional integer scalar, the number of years an institution officially schedules for completing a program. Default 4 years. Argument to be used by name.
Equivalent usage. The following implementations yield identical results,
# Required arguments in order and explicitly named x <- add_timely_term(dframe = DT, midfield_term = term) # Required arguments in order, but not named y <- add_timely_term(DT, term) # Using the implicit default for the midfield_term argument z <- add_timely_term(DT) # Demonstrate equivalence check_equiv_frames(x, y) check_equiv_frames(x, z)
Output. Adds the following columns to the data frame.
term_i
Student initial term, encoded YYYYT.
level_i
Student level (01 Freshman, 02 Sophomore, etc.) in their initial term.
adj_span
Integer span of years for timely completion, adjusted for a student's initial level
timely_term
Latest term by which degree completion would be considered timely. Encoded YYYYT.
# Add timely term column and supporting variables DT <- add_timely_term(DT, term) DT
Examining the records of selected students in detail.
Example 1. The student's initial term is Fall 2007 (encoded 20071
) and their initial level is 01 First-year
. The number of years to timely completion is 6 years, that is, academic years 2007--08, 08--09, 09--10, 10--11, 11--12, 12--13. Thus their timely completion term is Spring 2013 (encoded 20123
).
# Display one student by ID DT[mcid == "MCID3112785480"]
Example 2. The student's initial term is Spring 2002 (encoded 20013
) and their initial level is 03 Third-year
from which we infer they have completed two years of their program, yielding an adjusted span of 4 years. Those four years would encompass terms 20013
--20021
, 20023
--20031
, 20033
--20041
, and 20043
--20051
, yielding a timely completion term of Fall 2005.
# Display one student by ID DT[mcid == "MCID3111860641"]
Arguments of midfieldr functions accept alternate names, should the source-data file names in your workspace be named something other than student
, term
, etc. For example, if we were working with the "toy" (exercise) data sets included with midfieldr, we might write something like this,
# A toy set of IDs toy_mcid <- toy_student[, .(mcid)] # Source data table names that differ from the defaults toy_DT <- add_timely_term(dframe = toy_mcid, midfield_term = toy_term) # Equivalently toy_DT <- add_timely_term(toy_mcid, toy_term) toy_DT
Existing columns with the same names as one of the added columns are deleted and replaced. Using the toy data to illustrate, we drop the columns added by timely term except adj_span
.
# Drop three columns toy_DT <- toy_DT[, c("term_i", "level_i", "timely_term") := NULL] toy_DT
Reapplying the function, the adj_span
column is silently deleted and replaced.
# Demonstrate overwriting toy_DT <- add_timely_term(toy_DT, toy_term) toy_DT
add_data_sufficiency()
Add a column to a data frame of Student Unit Record (SUR) observations that labels each row for inclusion or exclusion based on data sufficiency near the upper and lower bounds of an institution's data range.
Arguments.
dframe
Data frame of student-level records keyed by student ID. Required variables are mcid
and timely_term
.
midfield_term
Data frame of student-level term observations keyed by student ID. Default is term
. Required variables are mcid
, institution
, and term
.
Equivalent usage. The following implementations yield identical results,
# Required arguments in order and explicitly named x <- add_data_sufficiency(dframe = DT, midfield_term = term) # Required arguments in order, but not named y <- add_data_sufficiency(DT, term) # Using the implicit default for the midfield_term argument z <- add_data_sufficiency(DT) # Demonstrate equivalence check_equiv_frames(x, y) check_equiv_frames(x, z)
Output. Adds the following columns to the data frame.
term_i
Student initial term, encoded YYYYT.
lower_limit
Initial term of an institution's data range, encoded YYYYT.
upper_limit
Final term of an institution's data range, encoded YYYYT.
data_sufficiency
Label each observation for inclusion or exclusion based on data sufficiency: "include", indicating that available data are sufficient for estimating timely degree completion; "exclude-upper", indicating that data are insufficient at the upper limit of a data range; or "exclude-lower", indicating that data are insufficient at the lower limit.
# Un-clutter the printout DT <- DT[, .(mcid, institution, timely_term)] # Add data sufficiency column and supporting variables DT <- add_data_sufficiency(DT, term) DT
Similar to the details described in the previous section, add_data_sufficiency()
accepts [Alternate source names] and uses [Silent overwriting] when existing columns have the same name as one of the added columns.
#| include: false # Find the closer look IDs x <- copy(DT) x[data_sufficiency == "exclude-lower"] DT[mcid == "MCID3111142689"] x <- add_timely_term(x) x[adj_span == 4]
The data range for the institutions are:
# Data range by institution term[order(institution), .(min_term = min(term), max_term = max(term)), by = "institution"]
Example 3. Exemplifies "Student A" in Figure 1 or Figure 2. The student attends Institution C which has a data range of 1990--2015. The student's initial term is Fall 2007 so the 1990 lower-limit exclusion does not apply; the student's timely completion term is Spring 2013, so the 2015 upper-limit exclusion does not apply.
# Display one student by ID DT[mcid == "MCID3112785480"]
Example 4. Exemplifies "Student B" in Figure 1. The student attends Institution B which has a data range of 1988--2018. The student's initial term is Spring 2013 so the 1988 lower-limit exclusion does not apply; the student's timely completion term is Fall 2019, so the 2018 upper-limit exclusion does apply.
# Display one student by ID DT[mcid == "MCID3111170322"]
Example 5. Exemplifies "Student C" in Figure 2. The student attends Institution B which has a data range of 1988--2009. The student's initial term is Fall 1988 so the 1988 lower-limit exclusion applies.
# Display one student by ID DT[mcid == "MCID3112056754"]
Preparation. The term
data table is the intake for this section.
DT <- copy(term)
Data sufficiency. A summary code chunk for ready reference.
# Filter for data sufficiency, output unique IDs DT <- add_timely_term(DT, term) DT <- add_data_sufficiency(DT, term) DT <- DT[data_sufficiency == "include", .(mcid)] DT <- unique(DT)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.