#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-050-")
A bloc is a grouping of student-level records dealt with as a unit, for example, a grouping of starters in a program, graduates of a program, or ever enrolled in a program. We often use a left join merging operation to add one or more variables to a working data frame and filter on those variables to construct the desired bloc.
Different metrics require different blocs. Graduation rate, for example, requires starters and their graduating subset; stickiness requires ever-enrolled and their graduating subset. Subsequent articles describe FYE proxies (special case of starters), Starters, and Graduates. This article treats the ever-enrolled bloc and left joins generally.
Because a bloc is usually defined for specific programs, the final filter applied in gathering a bloc is often an inner join to filter by program labels, as derived in Programs.
This article in the MIDFIELD workflow.
We use left joins to add variables to a working data frame and filter for students ever-enrolled in the case study programs. Migrators (if any) yield more than one observation (program) for the same ID.
Filter source student-level data for data sufficiency and degree-seeking.
Gather ever-enrolled
Filter by program.
Start. If you are writing your own script to follow along, we use these packages in this article:
library(midfieldr) library(midfielddata) library(data.table)
Load. Practice datasets. View data dictionaries via ?student
, ?term
, ?degree
.
# Load practice data data(student, term, degree)
Loads with midfieldr. Prepared data. View data dictionary via ?study_programs
.
study_programs
(derived in Programs). Select (optional). Reduce the number of columns. Code reproduced from Getting started.
# Optional. Copy of source files with all variables source_student <- copy(student) source_term <- copy(term) source_degree <- copy(degree) # Optional. Select variables required by midfieldr functions student <- select_required(source_student) term <- select_required(source_term) degree <- select_required(source_degree)
Initialize. Assign a working data frame.
# Working data frame DT <- copy(term)
Data sufficiency. Filter to satisfy the data sufficiency criterion. Code reproduced from Data sufficiency.
# Filter for data sufficiency, output unique IDs DT <- add_timely_term(DT, term) DT <- add_data_sufficiency(DT, term) DT <- DT[data_sufficiency == "include", .(mcid)] DT <- unique(DT)
Degree seeking. Filter to retain degree seeking students via an inner join with student
. Code reproduced from Degree seeking.
# Filter for degree seeking, output unique IDs DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL] DT <- unique(DT)
Verify prepared data. Many analyses begin, as we do here, by filtering for data sufficiency and degree-seeking. For our convenience in subsequent articles, this set of IDs is included with midfieldr in the data set baseline_mcid
. Here we verify that the two data frames have the same content.
# Demonstrate equivalence check_equiv_frames(DT, baseline_mcid)
#| eval: false #| echo: false # Run manually # Writing external files setkey(DT, mcid) setkey(DT, NULL) baseline_mcid <- copy(DT) usethis::use_data(baseline_mcid, overwrite = TRUE)
We preserve this data frame as a baseline for examples in the article.
baseline <- copy(DT)
An left join is a merge operation between two data frames which returns all observations (rows) of the "left" data frame X
and all the matching rows in the "right" data frame Y
. Using data.table syntax, we have two approaches: merge(X, Y, by)
(similar to base R) and Y[X, j, on]
(native to data.table).
merge(X, Y, by)
The general form for a left join is
merge(X, Y, by, all.x = TRUE)
where
X
is the "left" data frame, all rows returned Y
is the "right" data frame, matching rows returned by
is the vector of shared column names to merge by all.x = TRUE
ensures the left joinIn this example, the Y
data frame is term
, from which we extract the ID and CIP columns before the join.
# Subset of term data frame to join cols_we_want <- term[, .(mcid, cip6)] # merge(X, Y, by) left join merge(DT, cols_we_want, by = c("mcid"), all.x = TRUE)
Alternatively, one can select Y
columns within the merge operation.
# merge(X, Y, by) left join merge(DT, term[, .(mcid, cip6)], by = c("mcid"), all.x = TRUE)
Y[X, j, on]
The second approach---native to data.table and computationally more efficient---has the form
Y[X, j, on]
where
X
is the "left" data frame, all rows returned Y
is the "right" data frame, matching rows returned j
selects columns from the joined data frame to retainon
is the vector of shared column names to merge on # Y[X, j, on] left join (data.table native syntax) term[DT, .(mcid, cip6), on = c("mcid")]
Demonstrate equivalence. Showing that the two approaches produce the same result.
# merge(X, Y, by) left join x <- merge(DT, term[, .(mcid, cip6)], by = c("mcid"), all.x = TRUE) setkey(x, NULL) # Y[X, j, on] left join y <- term[DT, .(mcid, cip6), on = c("mcid")] # Demonstrate equivalence check_equiv_frames(x, y)
Rows in X with no match in Y will have NA values in the columns normally filled with Y
values.
For example, not all students in DT
will earn a degree. After a left join (degree
into DT
), all rows of DT
are returned. IDs in DT
with no match in degree
have an NA in the term_degree
column (a variable from the degree
source table).
#| echo: false options(datatable.print.topn = 5)
x <- degree[DT, .(mcid, term_degree), on = c("mcid")] setkeyv(x, c("mcid")) x[]
The result has r length(unique(x$mcid))
unique IDs with r sum(!is.na(x$term_degree))
degrees.
Rows in X with multiple matches in Y yields a new row in X
for every matching row in Y
.
For example, most students in DT
will be enrolled in multiple terms. After a left join (term
into DT
), all rows in DT
are returned. IDs in DT
with multiple matches in term
have multiple rows in the result, differentiated by the values in the term
column (a variable from the term
source table).
x <- term[DT, .(mcid, term), on = c("mcid")] setkeyv(x, c("mcid", "term")) x[]
The result has r length(unique(x$mcid))
unique IDs distributed over r nrow(x)
observations.
#| echo: false options(datatable.print.topn = 3)
"Left" and "right" matter. In left joins (unlike inner joins), X[Y]
and Y[X]
return different results:
Y[X, j, on]
returns all rows of X
X[Y, j, on]
returns all rows of Y
# What we want x <- degree[DT, .(mcid, term_degree), on = c("mcid")] # Not what we want y <- DT[degree, .(mcid, term_degree), on = c("mcid")] # Same content? check_equiv_frames(x, y) # Compare N rows nrow(x) nrow(y)
Work. The baseline data frame we preserved earlier is the intake for this section.
# Reusable starting state DT <- copy(baseline) DT[]
Add a variable. Use a left join from term
to DT
to add the CIP variable.
# Left-outer join from term to DT DT <- term[DT, .(mcid, cip6), on = c("mcid")]
Filter. Filter to retain unique combinations if ID and CIP code.
# One observation per ID-CIP combination DT <- unique(DT) DT
Filter. Because "ever-enrolled" usually means "ever-enrolled in specific programs," this bloc concludes with a filter by program. Code reproduced from Groupings.
# Filter by program DT <- study_programs[DT, on = c("cip6"), nomatch = NULL] DT[, cip6 := NULL] DT <- unique(DT) DT
Preparation. The baseline data frame we preserved earlier is the intake for this section.
DT <- copy(baseline)
Ever-enrolled. A summary code chunk for ready reference. Requires editing of study_programs
before reuse with different programs.
# Ever-enrolled bloc DT <- term[DT, .(mcid, cip6), on = c("mcid")] DT <- unique(DT) # Filter by program DT <- study_programs[DT, on = c("cip6"), nomatch = NULL] DT[, cip6 := NULL] DT <- unique(DT)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.