#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-002-")

Part 2 of a case study in three parts, illustrating how we work with longitudinal student-level records.

  1. Goals.   Introducing the study.

  2. Data.   Transforming the data to yield the observations of interest.

  3. Results.   Summary statistics, metric, chart, and table.

Method

Our data processing goal is to reduce the source data tables to the specific observations needed to compute our metrics. The data processing tasks include filtering observations, creating, renaming, and recoding variables, and joining data frames.

The analysis is organized to produce two data frames---students ever enrolled in the programs and students graduating from the programs---that are joined and written to file as a starting point for developing the results.


Load data

Start.   If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load.   Practice datasets. View data dictionaries via ?student, ?term, ?degree.

# Load practice data
data(student, term, degree)

Initial processing

Select (optional).   Reduce the number of columns to the minimum needed by the midfieldr functions.

# Work with required midfieldr variables only
student <- select_required(student)
term <- select_required(term)
degree <- select_required(degree)

Initialize.   Assign a working data frame. We often start with the term dataset.

# Working data frame
DT <- copy(term)
DT

The result has r nrow(DT) observations. In the case study, we will typically note the number of observations as they change.

Filter for data sufficiency

Some student records near the lower and upper terms that bound the available data must be excluded to prevent false summaries involving timely degree completion. To apply this filter, we first determine the timely completion term.


Add variables.   Using information in term, we add the timely_term variable as well as supporting variables used in its construction.

# Determine a timely completion term for every student
DT <- add_timely_term(DT, term)
DT

Add variables.   Using information in term, we add the data_sufficiency variable as well as supporting variables used in its construction.

# Determine data sufficiency for every student
DT <- add_data_sufficiency(DT, term)
DT

Filter. We filter to retain observations for which the data are sufficient then drop all but the ID variable.

# Retain observations having sufficient data
DT <- DT[data_sufficiency == "include"]
DT <- DT[, .(mcid)]
DT <- unique(DT)
DT

The result has r nrow(DT) observations.

Filter for degree seeking


Filter.   Use an inner join with student to retain degree-seeking students only. Select the ID column.

# Filter for degree seeking, output unique IDs
DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]
DT <- unique(DT)
DT

The result has r nrow(DT) observations. (No change is expected in this example because all students in the midfielddata practice data are degree-seeking.) We preserve this data frame as a baseline set of IDs to be used again.

baseline <- copy(DT)

Identify programs

In MIDFIELD datasets, the cip6 variable identifies the 6-digit code for the program in which a student is enrolled in a given term.


We have already searched cip to obtain the codes for the four programs in our case study. The first four digits of the 6-digit CIP codes are:

From cip, we obtain all codes that start with any of the selected 4-digit codes.

# Four engineering programs using 4-digit CIP codes
selected_programs <- filter_cip(c("^1408", "^1410", "^1419", "^1427", "^1435", "^1436", "^1437"))
selected_programs

Add a variable.   User-defined program names are nearly always required. Add a variable to label each of these r nrow(selected_programs) programs with one of the four conventional program abbreviations we will use in comparing metrics, i.e., Civil (CE), Electrical (EE), Mechanical (ME), and Industrial/Systems Engineering (ISE).

# Recode program labels. Edit as required.
selected_programs[, program := fcase(
  cip6 %like% "^1408", "CE",
  cip6 %like% "^1410", "EE",
  cip6 %like% "^1419", "ME",
  cip6 %chin% c("142701", "143501", "143601", "143701"), "ISE"
)]

Confirm that the abbreviations match the original 4-digit CIP names. We also illustrate using options() to change the number of data.table rows to print.

# Preserve settings
op <- options()
# Edit number of rows to print
options(datatable.print.nrows = 15)

# Confirm that abbreviations match the longer program names
selected_programs[, .(cip4name, program)]

Having checked that the new abbreviations correctly represent the programs, we can finalize the data frame of program CIPs and names.

selected_programs <- selected_programs[, .(cip6, program)]
selected_programs

# Restore original settings
options(op)

Gather ever-enrolled

Reset   The data frame of baseline IDs is the intake for this section.

# IDs of data-sufficient, degree-seeking students
DT <- copy(baseline)
DT

The result has r nrow(DT) observations.


Left join (add a variable).   Returns all rows from DT and rows from term that match on mcid---in effect, adding the cip6 variable to DT. Additionally, because term contains multiple rows per ID, the merged data frame also has the possibility of multiple rows per ID.

# Left-outer join from term to DT
DT <- term[DT, .(mcid, cip6), on = c("mcid")]
DT <- unique(DT)
DT

The result has r nrow(DT) observations.

Inner join (add a variable, filter observations).   Returns rows in DT and study_programs that match on cip6. In effect, we add a column of program labels to DT and simultaneously filter DT to retain rows that match the four case study programs only.

# Join program names and retain desired programs only
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT

The result has r nrow(DT) observations.

Filter.   Because students can change CIP codes but remain within the same labeled group (e.g., ISE), we drop the cip6 code and filter for unique combinations of ID and program label.

# Filter for unique ID-program combinations
DT[, cip6 := NULL]
DT <- unique(DT)
DT

The result has r nrow(DT) observations.

Copy.   Set aside the ever enrolled information under a new name to use later for joining with graduates.

# Prepare for joining
setcolorder(DT, c("mcid", "program"))
ever_enrolled <- copy(DT)
ever_enrolled

Gather graduates

Reset   The data frame of baseline IDs is the intake for this section. As before, the result has r nrow(baseline) observations.

# IDs of data-sufficient, degree-seeking students
DT <- copy(baseline)
DT

Add variables.   We use term to again add the timely_term variable and its supporting variables.

# Add timely completion term
DT <- add_timely_term(DT, term)
DT

Add variables.   We use degree to add the completion_status variable and its supporting variables.

# Add completion status
DT <- add_completion_status(DT, degree)
DT

Filter.   Retain observations of timely completers only. Drop unnecessary variables.

# Retain timely completers
DT <- DT[completion_status == "timely"]
DT <- DT[, .(mcid)]
DT

The result has r nrow(DT) observations.

Left join (add variables).   We use a left-join with degree to add the CIP codes and terms of the degrees earned.

DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")]
DT

The result has r nrow(DT) observations.

Inner join (add a variable, filter observations)   Again, add a column of program labels and filter by program.

# Join programs
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT

The result has r nrow(DT) observations.

Filter.   Students may have earned multiple degrees in different terms. We retain degrees earned in their first degree term only.

DT <- DT[, .SD[which.min(term_degree)], by = "mcid"]
DT

The result has r nrow(DT) observations.

Filter.   Drop unnecessary variables and filter for unique observations of ID and program label.

# Filter for unique ID-program combinations
DT[, c("cip6", "term_degree") := NULL]
DT <- unique(DT)
DT

Copy.   Set aside the graduates information under a new name to use for joining with ever enrolled.

# Prepare for joining
setcolorder(DT, c("mcid", "program"))
graduates <- copy(DT)
graduates

Add groupings

We plan to group the data by program, bloc, race/ethnicity, and sex. Program is already present. Bloc labels are added next.


Add a variable.   We add a bloc variable to the ever enrolled and graduates data frames before joining.

ever_enrolled[, bloc := "ever_enrolled"]
graduates[, bloc := "graduates"]

Join.   Combine the two data frames by rows, binding by matching column names.

# Combine two data frames
DT <- rbindlist(list(ever_enrolled, graduates), use.names = TRUE)
DT

The result has r nrow(DT) observations.


Add variables.   Use a left join, matching on mcid, to add race/ethnicity and sex to the data frame.

# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]
DT

Verify prepared data.   study_observations, included with midfieldr, contains the case study information developed above. Here we verify that the two data frames have the same content.

# Demonstrate equivalence
check_equiv_frames(DT, study_observations)

In this form, the observations are the starting point for part 3 of the case study.

Closer look

We examine the study observations for a few specific students to better illustrate the structure of these data.

#| echo: false
#| eval: false
# Find example IDs

x <- graduates[ever_enrolled, on = "mcid"]
y <- x[is.na(program)]
y$mcid[duplicated(y$mcid)]

y <- x[!is.na(program)]
y$mcid[duplicated(y$mcid)]

x[program == i.program]

mcid_we_want <- "MCID3112470255"
DT[mcid == mcid_we_want]
term[mcid == mcid_we_want]
degree[mcid == mcid_we_want]
#| echo: false
# Preserve settings
op <- options()

# Edit number of rows to print
options(datatable.print.nrows = 15)

Example 1.   This ID yields one observation only. The student was enrolled in Electrical Engineering but did not complete one of the four case study programs.

# Display one student by ID
mcid_we_want <- "MCID3111171519"
DT[mcid == mcid_we_want]

A closer look at the student's term record confirms the result: the student was enrolled in CIP 141001 (Electrical Engineering) but switched to CIP 110701 (Computer Science). The degree record indicates that the student graduated in Computer Science.

# Closer look at term
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]

Example 2.   This ID yields two observations indicating that the student was enrolled in Industrial/Systems Engineering and a timely graduate of that program.

# Display one student by ID
mcid_we_want <- "MCID3111150194"
DT[mcid == mcid_we_want]

The term and degree excerpts confirm those observations.

# Closer look at terms
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]

Example 3.   This ID yields two observations indicating that the student was enrolled in Electrical Engineering and in Civil Engineering but a timely graduate of neither program.

# Display one student by ID
mcid_we_want <- "MCID3111264877"
DT[mcid == mcid_we_want]

The term excerpt agrees; the degree record shows they graduated in CIP 261399 (Biological and Biomedical Sciences).

# Closer look at term
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]

Example 4.   This ID yields four observations indicating that the student was enrolled in Civil, Electrical, and Mechanical Engineering and a timely graduate of Mechanical.

# Display one student by ID
mcid_we_want <- "MCID3112470255"
DT[mcid == mcid_we_want]

The term and degree excerpts confirm those observations.

# Closer look at term
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]
#| echo: false
# Restore original settings
options(op)

References




MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.