#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-020-")
library(ggplot2)

The time span (or range) of MIDFIELD data varies by institution. At the upper and lower limits of a data range, a potential for false counts exists when a metric (such as graduation rate) requires knowledge of timely degree completion. For such metrics, student records that produce problematic results due to insufficient data are nearly always excluded from study.

This article in the MIDFIELD workflow.

  1. Planning
  2. [Initial processing]{.accent}
    • [Data sufficiency]{.accent}
    • Degree seeking
    • Identify programs
  3. Blocs
  4. Groupings
  5. Metrics
  6. Displays

Definitions




Upper-limit data sufficiency

For students admitted too near the upper limit of their institution's data range, the available data cover an insufficient number of years to know if completion is timely. To illustrate, in the figure we compare two students admitted in different terms with representative time spans shown for timely completion. In this scenario, we assume institution data is available from 1986 to 1996.


#| echo: false
#| label: fig01
#| fig-width: 8.4
#| fig-cap: "Figure 1: Upper limit data sufficiency."

# upper limit chart

# parameters
callout_color <- "gray60"
callout_line_size <- 0.3
anno_size <- 3.5 # 3.5 approx 10 point
vert_baseline <- 1.07
del <- 0.15 # delta y for dimension lines

# vertical dim lines
vert_dim_lines <- function(dt, cols) {
  data <- copy(dt)
  data[, .SD, .SDcols = cols]
  geom_segment(
    data = data,
    na.rm = TRUE,
    aes(
      x = get(cols[1]),
      xend = get(cols[1]),
      y = get(cols[2]) - del,
      yend = get(cols[2]) + del
    ),
    color = callout_color,
    linewidth = callout_line_size,
    linetype = 1
  )
}

# construct coordinates
coord <- wrapr::build_frame(
  "id", "row", "dash1", "dash2", "arrow1", "arrow2", "arrowlabel", "labelx" |
    "range", 3, NA, NA, 1986, 1996, "institution data range", 1989 |
    "A", 2, NA, NA, 1988, 1994, "TC term", 1994 |
    "B", 1, 1996, 1998, 1993, 1996, "TC term", 1998 |
    "limit", 0.2, NA, NA, NA, NA, "upper data limit", 1996
) |> data.table()

# x scale ticks
breaks_x <- sort(unique(coord[, round(c(dash1, dash2, arrow1, arrow2), 0)], na.rm = TRUE))

ggplot() +
  scale_x_continuous(breaks = breaks_x, limits = c(1983.5, 1999)) +
  scale_y_continuous(limits = c(-0.5, 3.5)) +
  labs(x = "Year", y = "") +
  theme_light() +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  geom_segment( # arrows
    data = coord[, .(row, arrow1, arrow2)],
    na.rm = TRUE,
    aes(x = arrow1, xend = arrow2, y = row, yend = row),
    color = callout_color,
    linewidth = callout_line_size,
    arrow = arrow(
      type = "closed",
      length = unit(c(2, 2, 0), "mm"),
      ends = c("both", rep("last", nrow(coord) - 1))
    ),
    arrow.fill = callout_color
  ) +
  geom_segment( # dashed lines
    data = coord[, .(row, dash1, dash2)],
    na.rm = TRUE,
    aes(x = dash1, xend = dash2, y = row, yend = row),
    color = callout_color,
    linewidth = 1.5 * callout_line_size,
    linetype = 2
  ) +
  geom_point( # entry dots
    data = coord[id %chin% LETTERS, .(id, row, arrow1)],
    na.rm = TRUE,
    aes(x = arrow1, y = row),
    size = 2,
    color = callout_color
  ) +
  geom_segment(
    aes(
      x = 1996,
      xend = 1996,
      y = min(coord$row) - 2 * del,
      yend = max(coord$row) + 2 * del
    ),
    color = callout_color,
    linewidth = callout_line_size
  ) +
  vert_dim_lines(dt = coord, cols = c("dash2", "row")) +
  vert_dim_lines(dt = coord[id %chin% LETTERS], cols = c("arrow2", "row")) +
  vert_dim_lines(dt = coord[!id %chin% LETTERS], cols = c("arrow1", "row")) +
  geom_text(
    data = coord[, .(row, arrowlabel, labelx)],
    na.rm = TRUE,
    aes(x = labelx, y = row, label = arrowlabel),
    vjust = 1.5,
    hjust = -0.1
  ) +
  geom_text(
    data = coord[id %chin% LETTERS, .(id, row)],
    na.rm = TRUE,
    aes(x = 1986, y = row, label = paste0("student ", id, ":")),
    vjust = 0.5,
    hjust = 1.25
  )

Student A

: Student A enters in 1988 with a timely completion (TC) term in 1994. In both of the following cases, the data sufficiency criterion is satisfied and the records are included in a study.

Student B

: Student B enters in 1993 with a TC term in 1998, two years beyond the range of the data. We have several possible cases,

Because the outcomes in cases B-3 and B-4 are not in the record, to include case B-1 and B-2 invariably produces a miscount of timely completers, late completers, and non-completers. Thus all student B records are excluded from the study.

Lower-limit data sufficiency

To determine data sufficiency record exclusions at the lower limit of the data range, we compare a student's first term (non-summer) to the first term of the data range (also non-summer). When these two terms are identical, the complete unit record is excluded. We illustrate with the three scenarios described below.


#| echo: false
#| label: fig02
#| fig.width: 8.4
#| fig-cap: "Figure 2: Lower limit data sufficiency."

# lower limit chart

# construct coordinates
coord <- wrapr::build_frame(
  "id", "row", "dash1", "dash2", "arrow1", "arrow2", "arrowlabel", "labelx" |
    "range", 4, NA, NA, 1986, 1996, "institution data range", 1989 |
    "A", 3, NA, NA, 1986.5, 1992.5, "TC term", 1992.5 |
    "C", 2, 1984, 1986, 1986, 1992, "TC term", 1992 |
    "D", 1, 1984, 1985.8, NA, NA, "lower data limit", 1986
) |> data.table()

# x scale ticks
breaks_x <- sort(unique(coord[, round(c(dash1, dash2, arrow1, arrow2), 0)], na.rm = TRUE))

# lower limit graph
ggplot() +
  scale_x_continuous(breaks = breaks_x, limits = c(1981.5, 1997)) +
  scale_y_continuous(limits = c(0.5, 4.5)) +
  labs(x = "Year", y = "") +
  theme_light() +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  geom_segment( # arrows
    data = coord[, .(row, arrow1, arrow2)],
    na.rm = TRUE,
    aes(x = arrow1, xend = arrow2, y = row, yend = row),
    color = callout_color,
    linewidth = callout_line_size,
    arrow = arrow(
      type = "closed",
      length = unit(2, "mm"),
      ends = c("both", rep("last", nrow(coord) - 1))
    ),
    arrow.fill = callout_color
  ) +
  geom_segment( # dashed lines
    data = coord[, .(row, dash1, dash2)],
    na.rm = TRUE,
    aes(x = dash1, xend = dash2, y = row, yend = row),
    color = callout_color,
    linewidth = 1.5 * callout_line_size,
    linetype = 2
  ) +
  geom_point( # entry dots
    data = coord[id %chin% LETTERS, .(id, row, arrow1)],
    na.rm = TRUE,
    aes(x = arrow1, y = row),
    size = 2,
    color = callout_color
  ) +
  geom_segment(
    aes(
      x = 1986,
      xend = 1986,
      y = min(coord$row) - 2 * del,
      yend = max(coord$row) + 2 * del
    ),
    color = callout_color,
    linewidth = callout_line_size
  ) +
  vert_dim_lines(dt = coord, cols = c("dash1", "row")) +
  vert_dim_lines(dt = coord, cols = c("dash2", "row")) +
  vert_dim_lines(dt = coord, cols = c("arrow2", "row")) +
  geom_text(
    data = coord[, .(row, arrowlabel, labelx)],
    na.rm = TRUE,
    aes(x = labelx, y = row, label = arrowlabel),
    vjust = 1.5,
    hjust = -0.1
  ) +
  geom_text(
    data = coord[id %chin% LETTERS, .(id, row)],
    na.rm = TRUE,
    aes(x = 1984, y = row, label = paste0("student ", id, ":")),
    vjust = 0.5,
    hjust = 1.25
  )

Student A

: Like Student A in Figure 1, they enter the dataset in a term following the data lower limit and are included in a study.

Student C

: Student C enters the institution before the lower limit of the data range (a "continuing" student) or they enter the institution at the lower limit precisely.

Student D

: Student D enters the institution at the same time as continuing student C but leaves the database before the data lower limit term.

The balance of these two effects is unknowable. Since student D cannot possibly be included, Student C must also be excluded.

Method

Specific student unit records at the upper and lower limits of an institution's data range must be excluded to prevent false counts due to insufficient data. Based on the discussion above, two specific filters are implemented:


Load data

Start.   If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load.   Practice datasets. View data dictionary via ?term.

# Load data
data(term)

Initial processing

Select (optional).   Reduce the number of columns. Code reproduced from Getting started.

# Copy of source files with all variables
source_term <- copy(term)

# Select variables required by midfieldr functions
term <- select_required(source_term)

Initialize.   Assign a working data frame.

# Working data frame
DT <- copy(term)
DT

Select.   The ID column is required. The institution column is not, but is convenient when taking a closer look at the results.

# Retain the minimum number of columns
DT <- DT[, .(mcid, institution)]

Filter.   Retain unique IDs.

# Filter for unique IDs
DT <- unique(DT)
DT

add_timely_term()

Add a column to a data frame of student-level data that indicates the latest term by which degree completion would be considered timely for every student.

Arguments.

Equivalent usage.   The following implementations yield identical results,

# Required arguments in order and explicitly named
x <- add_timely_term(dframe = DT, midfield_term = term)

# Required arguments in order, but not named
y <- add_timely_term(DT, term)

# Using the implicit default for the midfield_term argument
z <- add_timely_term(DT)

# Demonstrate equivalence
check_equiv_frames(x, y)
check_equiv_frames(x, z)

Output.   Adds the following columns to the data frame.

# Add timely term column and supporting variables
DT <- add_timely_term(DT, term)
DT

Closer look

Examining the records of selected students in detail.

Example 1.   The student's initial term is Fall 2007 (encoded 20071) and their initial level is 01 First-year. The number of years to timely completion is 6 years, that is, academic years 2007--08, 08--09, 09--10, 10--11, 11--12, 12--13. Thus their timely completion term is Spring 2013 (encoded 20123).

# Display one student by ID
DT[mcid == "MCID3112785480"]

Example 2.   The student's initial term is Spring 2002 (encoded 20013) and their initial level is 03 Third-year from which we infer they have completed two years of their program, yielding an adjusted span of 4 years. Those four years would encompass terms 20013--20021, 20023--20031, 20033--20041, and 20043--20051, yielding a timely completion term of Fall 2005.

# Display one student by ID
DT[mcid == "MCID3111860641"]

Alternate source names

Arguments of midfieldr functions accept alternate names, should the source-data file names in your workspace be named something other than student, term, etc. For example, if we were working with the "toy" (exercise) data sets included with midfieldr, we might write something like this,

# A toy set of IDs
toy_mcid <- toy_student[, .(mcid)]

# Source data table names that differ from the defaults
toy_DT <- add_timely_term(dframe = toy_mcid, midfield_term = toy_term)

# Equivalently
toy_DT <- add_timely_term(toy_mcid, toy_term)
toy_DT

Silent overwriting

Existing columns with the same names as one of the added columns are deleted and replaced. Using the toy data to illustrate, we drop the columns added by timely term except adj_span.

# Drop three columns
toy_DT <- toy_DT[, c("term_i", "level_i", "timely_term") := NULL]
toy_DT

Reapplying the function, the adj_span column is silently deleted and replaced.

# Demonstrate overwriting
toy_DT <- add_timely_term(toy_DT, toy_term)
toy_DT

add_data_sufficiency()

Add a column to a data frame of Student Unit Record (SUR) observations that labels each row for inclusion or exclusion based on data sufficiency near the upper and lower bounds of an institution's data range.

Arguments.

Equivalent usage.   The following implementations yield identical results,

# Required arguments in order and explicitly named
x <- add_data_sufficiency(dframe = DT, midfield_term = term)

# Required arguments in order, but not named
y <- add_data_sufficiency(DT, term)

# Using the implicit default for the midfield_term argument
z <- add_data_sufficiency(DT)

# Demonstrate equivalence
check_equiv_frames(x, y)
check_equiv_frames(x, z)

Output.   Adds the following columns to the data frame.

# Un-clutter the printout
DT <- DT[, .(mcid, institution, timely_term)]

# Add data sufficiency column and supporting variables
DT <- add_data_sufficiency(DT, term)
DT

Similar to the details described in the previous section, add_data_sufficiency() accepts [Alternate source names] and uses [Silent overwriting] when existing columns have the same name as one of the added columns.

#| include: false
# Find the closer look IDs

x <- copy(DT)
x[data_sufficiency == "exclude-lower"]



DT[mcid == "MCID3111142689"]


x <- add_timely_term(x)
x[adj_span == 4]

Closer look

The data range for the institutions are:

# Data range by institution
term[order(institution), .(min_term = min(term), max_term = max(term)), by = "institution"]

Example 3.   Exemplifies "Student A" in Figure 1 or Figure 2. The student attends Institution C which has a data range of 1990--2015. The student's initial term is Fall 2007 so the 1990 lower-limit exclusion does not apply; the student's timely completion term is Spring 2013, so the 2015 upper-limit exclusion does not apply.

# Display one student by ID
DT[mcid == "MCID3112785480"]

Example 4.   Exemplifies "Student B" in Figure 1. The student attends Institution B which has a data range of 1988--2018. The student's initial term is Spring 2013 so the 1988 lower-limit exclusion does not apply; the student's timely completion term is Fall 2019, so the 2018 upper-limit exclusion does apply.

# Display one student by ID
DT[mcid == "MCID3111170322"]

Example 5.   Exemplifies "Student C" in Figure 2. The student attends Institution B which has a data range of 1988--2009. The student's initial term is Fall 1988 so the 1988 lower-limit exclusion applies.

# Display one student by ID
DT[mcid == "MCID3112056754"]

Reusable code

Preparation.   The term data table is the intake for this section.

DT <- copy(term)

Data sufficiency.   A summary code chunk for ready reference.

# Filter for data sufficiency, output unique IDs
DT <- add_timely_term(DT, term)
DT <- add_data_sufficiency(DT, term)
DT <- DT[data_sufficiency == "include", .(mcid)]
DT <- unique(DT)

References




MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.