#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-002-")
Part 2 of a case study in three parts, illustrating how we work with longitudinal student-level records.
Goals. Introducing the study.
Data. Transforming the data to yield the observations of interest.
Results. Summary statistics, metric, chart, and table.
Our data processing goal is to reduce the source data tables to the specific observations needed to compute our metrics. The data processing tasks include filtering observations, creating, renaming, and recoding variables, and joining data frames.
The analysis is organized to produce two data frames---students ever enrolled in the programs and students graduating from the programs---that are joined and written to file as a starting point for developing the results.
Start. If you are writing your own script to follow along, we use these packages in this article:
library(midfieldr) library(midfielddata) library(data.table)
Load. Practice datasets. View data dictionaries via ?student
, ?term
, ?degree
.
# Load practice data data(student, term, degree)
Select (optional). Reduce the number of columns to the minimum needed by the midfieldr functions.
# Work with required midfieldr variables only student <- select_required(student) term <- select_required(term) degree <- select_required(degree)
Initialize. Assign a working data frame. We often start with the term
dataset.
# Working data frame DT <- copy(term) DT
The result has r nrow(DT)
observations. In the case study, we will typically note the number of observations as they change.
Some student records near the lower and upper terms that bound the available data must be excluded to prevent false summaries involving timely degree completion. To apply this filter, we first determine the timely completion term.
Add variables. Using information in term
, we add the timely_term
variable as well as supporting variables used in its construction.
# Determine a timely completion term for every student DT <- add_timely_term(DT, term) DT
Add variables. Using information in term
, we add the data_sufficiency
variable as well as supporting variables used in its construction.
# Determine data sufficiency for every student DT <- add_data_sufficiency(DT, term) DT
Filter. We filter to retain observations for which the data are sufficient then drop all but the ID variable.
# Retain observations having sufficient data DT <- DT[data_sufficiency == "include"] DT <- DT[, .(mcid)] DT <- unique(DT) DT
The result has r nrow(DT)
observations.
Filter. Use an inner join with student
to retain degree-seeking students only. Select the ID column.
# Filter for degree seeking, output unique IDs DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL] DT <- unique(DT) DT
The result has r nrow(DT)
observations. (No change is expected in this example because all students in the midfielddata practice data are degree-seeking.) We preserve this data frame as a baseline set of IDs to be used again.
baseline <- copy(DT)
In MIDFIELD datasets, the cip6
variable identifies the 6-digit code for the program in which a student is enrolled in a given term.
We have already searched cip
to obtain the codes for the four programs in our case study. The first four digits of the 6-digit CIP codes are:
From cip
, we obtain all codes that start with any of the selected 4-digit codes.
# Four engineering programs using 4-digit CIP codes selected_programs <- filter_cip(c("^1408", "^1410", "^1419", "^1427", "^1435", "^1436", "^1437")) selected_programs
Add a variable. User-defined program names are nearly always required. Add a variable to label each of these r nrow(selected_programs)
programs with one of the four conventional program abbreviations we will use in comparing metrics, i.e., Civil (CE), Electrical (EE), Mechanical (ME), and Industrial/Systems Engineering (ISE).
# Recode program labels. Edit as required. selected_programs[, program := fcase( cip6 %like% "^1408", "CE", cip6 %like% "^1410", "EE", cip6 %like% "^1419", "ME", cip6 %chin% c("142701", "143501", "143601", "143701"), "ISE" )]
Confirm that the abbreviations match the original 4-digit CIP names. We also illustrate using options()
to change the number of data.table rows to print.
# Preserve settings op <- options() # Edit number of rows to print options(datatable.print.nrows = 15) # Confirm that abbreviations match the longer program names selected_programs[, .(cip4name, program)]
Having checked that the new abbreviations correctly represent the programs, we can finalize the data frame of program CIPs and names.
selected_programs <- selected_programs[, .(cip6, program)] selected_programs # Restore original settings options(op)
Reset The data frame of baseline IDs is the intake for this section.
# IDs of data-sufficient, degree-seeking students DT <- copy(baseline) DT
The result has r nrow(DT)
observations.
Left join (add a variable). Returns all rows from DT
and rows from term
that match on mcid
---in effect, adding the cip6
variable to DT
. Additionally, because term
contains multiple rows per ID, the merged data frame also has the possibility of multiple rows per ID.
# Left-outer join from term to DT DT <- term[DT, .(mcid, cip6), on = c("mcid")] DT <- unique(DT) DT
The result has r nrow(DT)
observations.
Inner join (add a variable, filter observations). Returns rows in DT
and study_programs
that match on cip6
. In effect, we add a column of program labels to DT
and simultaneously filter DT
to retain rows that match the four case study programs only.
# Join program names and retain desired programs only DT <- study_programs[DT, on = c("cip6"), nomatch = NULL] DT
The result has r nrow(DT)
observations.
Filter. Because students can change CIP codes but remain within the same labeled group (e.g., ISE), we drop the cip6
code and filter for unique combinations of ID and program label.
# Filter for unique ID-program combinations DT[, cip6 := NULL] DT <- unique(DT) DT
The result has r nrow(DT)
observations.
Copy. Set aside the ever enrolled information under a new name to use later for joining with graduates.
# Prepare for joining setcolorder(DT, c("mcid", "program")) ever_enrolled <- copy(DT) ever_enrolled
Reset The data frame of baseline IDs is the intake for this section. As before, the result has r nrow(baseline)
observations.
# IDs of data-sufficient, degree-seeking students DT <- copy(baseline) DT
Add variables. We use term
to again add the timely_term
variable and its supporting variables.
# Add timely completion term DT <- add_timely_term(DT, term) DT
Add variables. We use degree
to add the completion_status
variable and its supporting variables.
# Add completion status DT <- add_completion_status(DT, degree) DT
Filter. Retain observations of timely completers only. Drop unnecessary variables.
# Retain timely completers DT <- DT[completion_status == "timely"] DT <- DT[, .(mcid)] DT
The result has r nrow(DT)
observations.
Left join (add variables). We use a left-join with degree
to add the CIP codes and terms of the degrees earned.
DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")] DT
The result has r nrow(DT)
observations.
Inner join (add a variable, filter observations) Again, add a column of program labels and filter by program.
# Join programs DT <- study_programs[DT, on = c("cip6"), nomatch = NULL] DT
The result has r nrow(DT)
observations.
Filter. Students may have earned multiple degrees in different terms. We retain degrees earned in their first degree term only.
DT <- DT[, .SD[which.min(term_degree)], by = "mcid"] DT
The result has r nrow(DT)
observations.
Filter. Drop unnecessary variables and filter for unique observations of ID and program label.
# Filter for unique ID-program combinations DT[, c("cip6", "term_degree") := NULL] DT <- unique(DT) DT
Copy. Set aside the graduates information under a new name to use for joining with ever enrolled.
# Prepare for joining setcolorder(DT, c("mcid", "program")) graduates <- copy(DT) graduates
We plan to group the data by program, bloc, race/ethnicity, and sex. Program is already present. Bloc labels are added next.
Add a variable. We add a bloc
variable to the ever enrolled and graduates data frames before joining.
ever_enrolled[, bloc := "ever_enrolled"] graduates[, bloc := "graduates"]
Join. Combine the two data frames by rows, binding by matching column names.
# Combine two data frames DT <- rbindlist(list(ever_enrolled, graduates), use.names = TRUE) DT
The result has r nrow(DT)
observations.
Add variables. Use a left join, matching on mcid
, to add race/ethnicity and sex to the data frame.
# Join race/ethnicity and sex cols_we_want <- student[, .(mcid, race, sex)] DT <- cols_we_want[DT, on = c("mcid")] DT
Verify prepared data. study_observations
, included with midfieldr, contains the case study information developed above. Here we verify that the two data frames have the same content.
# Demonstrate equivalence check_equiv_frames(DT, study_observations)
In this form, the observations are the starting point for part 3 of the case study.
We examine the study observations for a few specific students to better illustrate the structure of these data.
#| echo: false #| eval: false # Find example IDs x <- graduates[ever_enrolled, on = "mcid"] y <- x[is.na(program)] y$mcid[duplicated(y$mcid)] y <- x[!is.na(program)] y$mcid[duplicated(y$mcid)] x[program == i.program] mcid_we_want <- "MCID3112470255" DT[mcid == mcid_we_want] term[mcid == mcid_we_want] degree[mcid == mcid_we_want]
#| echo: false # Preserve settings op <- options() # Edit number of rows to print options(datatable.print.nrows = 15)
Example 1. This ID yields one observation only. The student was enrolled in Electrical Engineering but did not complete one of the four case study programs.
# Display one student by ID mcid_we_want <- "MCID3111171519" DT[mcid == mcid_we_want]
A closer look at the student's term
record confirms the result: the student was enrolled in CIP 141001 (Electrical Engineering) but switched to CIP 110701 (Computer Science). The degree
record indicates that the student graduated in Computer Science.
# Closer look at term term[mcid == mcid_we_want] # Closer look at degree degree[mcid == mcid_we_want]
Example 2. This ID yields two observations indicating that the student was enrolled in Industrial/Systems Engineering and a timely graduate of that program.
# Display one student by ID mcid_we_want <- "MCID3111150194" DT[mcid == mcid_we_want]
The term
and degree
excerpts confirm those observations.
# Closer look at terms term[mcid == mcid_we_want] # Closer look at degree degree[mcid == mcid_we_want]
Example 3. This ID yields two observations indicating that the student was enrolled in Electrical Engineering and in Civil Engineering but a timely graduate of neither program.
# Display one student by ID mcid_we_want <- "MCID3111264877" DT[mcid == mcid_we_want]
The term
excerpt agrees; the degree
record shows they graduated in CIP 261399 (Biological and Biomedical Sciences).
# Closer look at term term[mcid == mcid_we_want] # Closer look at degree degree[mcid == mcid_we_want]
Example 4. This ID yields four observations indicating that the student was enrolled in Civil, Electrical, and Mechanical Engineering and a timely graduate of Mechanical.
# Display one student by ID mcid_we_want <- "MCID3112470255" DT[mcid == mcid_we_want]
The term
and degree
excerpts confirm those observations.
# Closer look at term term[mcid == mcid_we_want] # Closer look at degree degree[mcid == mcid_we_want]
#| echo: false # Restore original settings options(op)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.