In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-020-")
library(ggplot2)

The time span (or range) of MIDFIELD data varies by institution. At the upper and lower limits of a data range, a potential for false counts exists when a metric (such as graduation rate) requires knowledge of timely degree completion. For such metrics, student records that produce problematic results due to insufficient data are nearly always excluded from study.

This article in the MIDFIELD workflow.

Planning
[Initial processing]{.accent}
- [Data sufficiency]{.accent}
- Degree seeking
- Identify programs
Blocs
Groupings
Metrics
Displays

Definitions

Upper-limit data sufficiency

For students admitted too near the upper limit of their institution's data range, the available data cover an insufficient number of years to know if completion is timely. To illustrate, in the figure we compare two students admitted in different terms with representative time spans shown for timely completion. In this scenario, we assume institution data is available from 1986 to 1996.

#| echo: false
#| label: fig01
#| fig-width: 8.4
#| fig-cap: "Figure 1: Upper limit data sufficiency."

# upper limit chart

# parameters
callout_color <- "gray60"
callout_line_size <- 0.3
anno_size <- 3.5 # 3.5 approx 10 point
vert_baseline <- 1.07
del <- 0.15 # delta y for dimension lines

# vertical dim lines
vert_dim_lines <- function(dt, cols) {
  data <- copy(dt)
  data[, .SD, .SDcols = cols]
  geom_segment(
    data = data,
    na.rm = TRUE,
    aes(
      x = get(cols[1]),
      xend = get(cols[1]),
      y = get(cols[2]) - del,
      yend = get(cols[2]) + del
    ),
    color = callout_color,
    linewidth = callout_line_size,
    linetype = 1
  )
}

# construct coordinates
coord <- wrapr::build_frame(
  "id", "row", "dash1", "dash2", "arrow1", "arrow2", "arrowlabel", "labelx" |
    "range", 3, NA, NA, 1986, 1996, "institution data range", 1989 |
    "A", 2, NA, NA, 1988, 1994, "TC term", 1994 |
    "B", 1, 1996, 1998, 1993, 1996, "TC term", 1998 |
    "limit", 0.2, NA, NA, NA, NA, "upper data limit", 1996
) |> data.table()

# x scale ticks
breaks_x <- sort(unique(coord[, round(c(dash1, dash2, arrow1, arrow2), 0)], na.rm = TRUE))

ggplot() +
  scale_x_continuous(breaks = breaks_x, limits = c(1983.5, 1999)) +
  scale_y_continuous(limits = c(-0.5, 3.5)) +
  labs(x = "Year", y = "") +
  theme_light() +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  geom_segment( # arrows
    data = coord[, .(row, arrow1, arrow2)],
    na.rm = TRUE,
    aes(x = arrow1, xend = arrow2, y = row, yend = row),
    color = callout_color,
    linewidth = callout_line_size,
    arrow = arrow(
      type = "closed",
      length = unit(c(2, 2, 0), "mm"),
      ends = c("both", rep("last", nrow(coord) - 1))
    ),
    arrow.fill = callout_color
  ) +
  geom_segment( # dashed lines
    data = coord[, .(row, dash1, dash2)],
    na.rm = TRUE,
    aes(x = dash1, xend = dash2, y = row, yend = row),
    color = callout_color,
    linewidth = 1.5 * callout_line_size,
    linetype = 2
  ) +
  geom_point( # entry dots
    data = coord[id %chin% LETTERS, .(id, row, arrow1)],
    na.rm = TRUE,
    aes(x = arrow1, y = row),
    size = 2,
    color = callout_color
  ) +
  geom_segment(
    aes(
      x = 1996,
      xend = 1996,
      y = min(coord$row) - 2 * del,
      yend = max(coord$row) + 2 * del
    ),
    color = callout_color,
    linewidth = callout_line_size
  ) +
  vert_dim_lines(dt = coord, cols = c("dash2", "row")) +
  vert_dim_lines(dt = coord[id %chin% LETTERS], cols = c("arrow2", "row")) +
  vert_dim_lines(dt = coord[!id %chin% LETTERS], cols = c("arrow1", "row")) +
  geom_text(
    data = coord[, .(row, arrowlabel, labelx)],
    na.rm = TRUE,
    aes(x = labelx, y = row, label = arrowlabel),
    vjust = 1.5,
    hjust = -0.1
  ) +
  geom_text(
    data = coord[id %chin% LETTERS, .(id, row)],
    na.rm = TRUE,
    aes(x = 1986, y = row, label = paste0("student ", id, ":")),
    vjust = 0.5,
    hjust = 1.25
  )

Student A

: Student A enters in 1988 with a timely completion (TC) term in 1994. In both of the following cases, the data sufficiency criterion is satisfied and the records are included in a study.

A-1: First time in college (FTIC), so we know their first term is their entry term (i.e., they are not a continuing student) and we can determine their TC term.
A-2: Transfer student, and we know their first term in a MIDFIELD institution. We have no knowledge of how much time was spent accumulating their pre-MIDFIELD credit hours, but we can estimate a TC term with respect to their "level" at entry, that is, entering as a first-year student, second-year student, etc.

Student B

: Student B enters in 1993 with a TC term in 1998, two years beyond the range of the data. We have several possible cases,

B-1: Before the data limit, the student completes their program (timely completion, known record)
B-2: Before the data limit, the student leaves the data base (non-completion, known record)
B-3: After the data limit, the student completes before their TC term (timely completion, no record)
B-4: After the data limit, the student completes after their TC term or fails to complete (late completion or non-completion, no record)

Because the outcomes in cases B-3 and B-4 are not in the record, to include case B-1 and B-2 invariably produces a miscount of timely completers, late completers, and non-completers. Thus all student B records are excluded from the study.

Lower-limit data sufficiency

To determine data sufficiency record exclusions at the lower limit of the data range, we compare a student's first term (non-summer) to the first term of the data range (also non-summer). When these two terms are identical, the complete unit record is excluded. We illustrate with the three scenarios described below.

#| echo: false
#| label: fig02
#| fig.width: 8.4
#| fig-cap: "Figure 2: Lower limit data sufficiency."

# lower limit chart

# construct coordinates
coord <- wrapr::build_frame(
  "id", "row", "dash1", "dash2", "arrow1", "arrow2", "arrowlabel", "labelx" |
    "range", 4, NA, NA, 1986, 1996, "institution data range", 1989 |
    "A", 3, NA, NA, 1986.5, 1992.5, "TC term", 1992.5 |
    "C", 2, 1984, 1986, 1986, 1992, "TC term", 1992 |
    "D", 1, 1984, 1985.8, NA, NA, "lower data limit", 1986
) |> data.table()

# x scale ticks
breaks_x <- sort(unique(coord[, round(c(dash1, dash2, arrow1, arrow2), 0)], na.rm = TRUE))

# lower limit graph
ggplot() +
  scale_x_continuous(breaks = breaks_x, limits = c(1981.5, 1997)) +
  scale_y_continuous(limits = c(0.5, 4.5)) +
  labs(x = "Year", y = "") +
  theme_light() +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  geom_segment( # arrows
    data = coord[, .(row, arrow1, arrow2)],
    na.rm = TRUE,
    aes(x = arrow1, xend = arrow2, y = row, yend = row),
    color = callout_color,
    linewidth = callout_line_size,
    arrow = arrow(
      type = "closed",
      length = unit(2, "mm"),
      ends = c("both", rep("last", nrow(coord) - 1))
    ),
    arrow.fill = callout_color
  ) +
  geom_segment( # dashed lines
    data = coord[, .(row, dash1, dash2)],
    na.rm = TRUE,
    aes(x = dash1, xend = dash2, y = row, yend = row),
    color = callout_color,
    linewidth = 1.5 * callout_line_size,
    linetype = 2
  ) +
  geom_point( # entry dots
    data = coord[id %chin% LETTERS, .(id, row, arrow1)],
    na.rm = TRUE,
    aes(x = arrow1, y = row),
    size = 2,
    color = callout_color
  ) +
  geom_segment(
    aes(
      x = 1986,
      xend = 1986,
      y = min(coord$row) - 2 * del,
      yend = max(coord$row) + 2 * del
    ),
    color = callout_color,
    linewidth = callout_line_size
  ) +
  vert_dim_lines(dt = coord, cols = c("dash1", "row")) +
  vert_dim_lines(dt = coord, cols = c("dash2", "row")) +
  vert_dim_lines(dt = coord, cols = c("arrow2", "row")) +
  geom_text(
    data = coord[, .(row, arrowlabel, labelx)],
    na.rm = TRUE,
    aes(x = labelx, y = row, label = arrowlabel),
    vjust = 1.5,
    hjust = -0.1
  ) +
  geom_text(
    data = coord[id %chin% LETTERS, .(id, row)],
    na.rm = TRUE,
    aes(x = 1984, y = row, label = paste0("student ", id, ":")),
    vjust = 0.5,
    hjust = 1.25
  )

Student A

: Like Student A in Figure 1, they enter the dataset in a term following the data lower limit and are included in a study.

Student C

: Student C enters the institution before the lower limit of the data range (a "continuing" student) or they enter the institution at the lower limit precisely.

C-1: If student C is continuing, regardless of status (FTIC or transfer), making an estimate of their TC term invariably leads to false counts because we have no knowledge of how much time was spent accumulating credit hours at their MIDFIELD institution before the lower data limit. Including C-1 would also produce false counts because of student D (discussed below).
C-2: If student C is not continuing, that is, their first time entry to a MIDFIELD institution is at the lower data limit (here, 1986), we would include them in a study if we could. Unfortunately, we cannot distinguish them from continuing students. Having to exclude C-1 inherently excludes C-2 as well.

Student D

: Student D enters the institution at the same time as continuing student C but leaves the database before the data lower limit term.

D-1: Student D did not timely-complete their program. In this case, if we include student C our count of non-completers is low (D-1 cases are missing), resulting in an inflated ratio of completers to non-completers.
D-2: Student D did timely-complete their program. Here, if we include student C our count of completers is low (D-2 cases are missing), resulting in a diminished ratio of completers to non-completers.

The balance of these two effects is unknowable. Since student D cannot possibly be included, Student C must also be excluded.

Method

Specific student unit records at the upper and lower limits of an institution's data range must be excluded to prevent false counts due to insufficient data. Based on the discussion above, two specific filters are implemented:

Lower limit. All IDs extant in the non-summer lower limit of an institution’s data range are labeled for possible exclusion.
Upper limit. All IDs for which the timely completion term exceeds the upper limit of the institution's data range are labeled for possible exclusion.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load. Practice datasets. View data dictionary via ?term.

# Load data
data(term)

Initial processing

Select (optional). Reduce the number of columns. Code reproduced from Getting started.

# Copy of source files with all variables
source_term <- copy(term)

# Select variables required by midfieldr functions
term <- select_required(source_term)

Initialize. Assign a working data frame.

# Working data frame
DT <- copy(term)
DT

Select. The ID column is required. The institution column is not, but is convenient when taking a closer look at the results.

# Retain the minimum number of columns
DT <- DT[, .(mcid, institution)]

Filter. Retain unique IDs.

# Filter for unique IDs
DT <- unique(DT)
DT

`add_timely_term()`

Add a column to a data frame of student-level data that indicates the latest term by which degree completion would be considered timely for every student.

Arguments.

dframe Data frame of student-level records keyed by student ID. Required variable (column) is mcid.
midfield_term Data frame of student-level term observations keyed by student ID. Default is term. Required variables (columns) are mcid, term, and level.
span Optional integer scalar, number of years to define timely completion. Commonly used values are are 100%, 150%, and 200% of sched_span. Default 6 years. Argument to be used by name.
sched_span Optional integer scalar, the number of years an institution officially schedules for completing a program. Default 4 years. Argument to be used by name.

Equivalent usage. The following implementations yield identical results,

# Required arguments in order and explicitly named
x <- add_timely_term(dframe = DT, midfield_term = term)

# Required arguments in order, but not named
y <- add_timely_term(DT, term)

# Using the implicit default for the midfield_term argument
z <- add_timely_term(DT)

# Demonstrate equivalence
check_equiv_frames(x, y)
check_equiv_frames(x, z)

Output. Adds the following columns to the data frame.

term_i Student initial term, encoded YYYYT.
level_i Student level (01 Freshman, 02 Sophomore, etc.) in their initial term.
adj_span Integer span of years for timely completion, adjusted for a student's initial level
timely_term Latest term by which degree completion would be considered timely. Encoded YYYYT.

# Add timely term column and supporting variables
DT <- add_timely_term(DT, term)
DT

Closer look

Examining the records of selected students in detail.

Example 1. The student's initial term is Fall 2007 (encoded 20071) and their initial level is 01 First-year. The number of years to timely completion is 6 years, that is, academic years 2007--08, 08--09, 09--10, 10--11, 11--12, 12--13. Thus their timely completion term is Spring 2013 (encoded 20123).

# Display one student by ID
DT[mcid == "MCID3112785480"]

Example 2. The student's initial term is Spring 2002 (encoded 20013) and their initial level is 03 Third-year from which we infer they have completed two years of their program, yielding an adjusted span of 4 years. Those four years would encompass terms 20013--20021, 20023--20031, 20033--20041, and 20043--20051, yielding a timely completion term of Fall 2005.

# Display one student by ID
DT[mcid == "MCID3111860641"]

Alternate source names

Arguments of midfieldr functions accept alternate names, should the source-data file names in your workspace be named something other than student, term, etc. For example, if we were working with the "toy" (exercise) data sets included with midfieldr, we might write something like this,

# A toy set of IDs
toy_mcid <- toy_student[, .(mcid)]

# Source data table names that differ from the defaults
toy_DT <- add_timely_term(dframe = toy_mcid, midfield_term = toy_term)

# Equivalently
toy_DT <- add_timely_term(toy_mcid, toy_term)
toy_DT

Silent overwriting

Existing columns with the same names as one of the added columns are deleted and replaced. Using the toy data to illustrate, we drop the columns added by timely term except adj_span.

# Drop three columns
toy_DT <- toy_DT[, c("term_i", "level_i", "timely_term") := NULL]
toy_DT

Reapplying the function, the adj_span column is silently deleted and replaced.

# Demonstrate overwriting
toy_DT <- add_timely_term(toy_DT, toy_term)
toy_DT

`add_data_sufficiency()`

Add a column to a data frame of Student Unit Record (SUR) observations that labels each row for inclusion or exclusion based on data sufficiency near the upper and lower bounds of an institution's data range.

Arguments.

dframe Data frame of student-level records keyed by student ID. Required variables are mcid and timely_term.
midfield_term Data frame of student-level term observations keyed by student ID. Default is term. Required variables are mcid, institution, and term.

Equivalent usage. The following implementations yield identical results,

# Required arguments in order and explicitly named
x <- add_data_sufficiency(dframe = DT, midfield_term = term)

# Required arguments in order, but not named
y <- add_data_sufficiency(DT, term)

# Using the implicit default for the midfield_term argument
z <- add_data_sufficiency(DT)

# Demonstrate equivalence
check_equiv_frames(x, y)
check_equiv_frames(x, z)

Output. Adds the following columns to the data frame.

term_i Student initial term, encoded YYYYT.
lower_limit Initial term of an institution's data range, encoded YYYYT.
upper_limit Final term of an institution's data range, encoded YYYYT.
data_sufficiency Label each observation for inclusion or exclusion based on data sufficiency: "include", indicating that available data are sufficient for estimating timely degree completion; "exclude-upper", indicating that data are insufficient at the upper limit of a data range; or "exclude-lower", indicating that data are insufficient at the lower limit.

# Un-clutter the printout
DT <- DT[, .(mcid, institution, timely_term)]

# Add data sufficiency column and supporting variables
DT <- add_data_sufficiency(DT, term)
DT

Similar to the details described in the previous section, add_data_sufficiency() accepts [Alternate source names] and uses [Silent overwriting] when existing columns have the same name as one of the added columns.

#| include: false
# Find the closer look IDs

x <- copy(DT)
x[data_sufficiency == "exclude-lower"]



DT[mcid == "MCID3111142689"]


x <- add_timely_term(x)
x[adj_span == 4]

Closer look

The data range for the institutions are:

# Data range by institution
term[order(institution), .(min_term = min(term), max_term = max(term)), by = "institution"]

Example 3. Exemplifies "Student A" in Figure 1 or Figure 2. The student attends Institution C which has a data range of 1990--2015. The student's initial term is Fall 2007 so the 1990 lower-limit exclusion does not apply; the student's timely completion term is Spring 2013, so the 2015 upper-limit exclusion does not apply.

# Display one student by ID
DT[mcid == "MCID3112785480"]

Example 4. Exemplifies "Student B" in Figure 1. The student attends Institution B which has a data range of 1988--2018. The student's initial term is Spring 2013 so the 1988 lower-limit exclusion does not apply; the student's timely completion term is Fall 2019, so the 2018 upper-limit exclusion does apply.

# Display one student by ID
DT[mcid == "MCID3111170322"]

Example 5. Exemplifies "Student C" in Figure 2. The student attends Institution B which has a data range of 1988--2009. The student's initial term is Fall 1988 so the 1988 lower-limit exclusion applies.

# Display one student by ID
DT[mcid == "MCID3112056754"]

Reusable code

Preparation. The term data table is the intake for this section.

DT <- copy(term)

Data sufficiency. A summary code chunk for ready reference.

# Filter for data sufficiency, output unique IDs
DT <- add_timely_term(DT, term)
DT <- add_data_sufficiency(DT, term)
DT <- DT[data_sufficiency == "include", .(mcid)]
DT <- unique(DT)

References

MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Upper-limit data sufficiency

Lower-limit data sufficiency

Method

Load data

Initial processing

`add_timely_term()`

Closer look

Alternate source names

Silent overwriting

`add_data_sufficiency()`

Closer look

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Upper-limit data sufficiency

Lower-limit data sufficiency

Method

Load data

Initial processing

add_timely_term()

Closer look

Alternate source names

Silent overwriting

add_data_sufficiency()

Closer look

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

`add_timely_term()`

`add_data_sufficiency()`