In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-050-")

A bloc is a grouping of student-level records dealt with as a unit, for example, a grouping of starters in a program, graduates of a program, or ever enrolled in a program. We often use a left join merging operation to add one or more variables to a working data frame and filter on those variables to construct the desired bloc.

Different metrics require different blocs. Graduation rate, for example, requires starters and their graduating subset; stickiness requires ever-enrolled and their graduating subset. Subsequent articles describe FYE proxies (special case of starters), Starters, and Graduates. This article treats the ever-enrolled bloc and left joins generally.

Because a bloc is usually defined for specific programs, the final filter applied in gathering a bloc is often an inner join to filter by program labels, as derived in Programs.

This article in the MIDFIELD workflow.

Planning
Initial processing
[Blocs]{.accent}
- [Ever-enrolled]{.accent}
- FYE proxies
- Starters
- Graduates
Groupings
Metrics
Displays

Definitions

Method

We use left joins to add variables to a working data frame and filter for students ever-enrolled in the case study programs. Migrators (if any) yield more than one observation (program) for the same ID.

Filter source student-level data for data sufficiency and degree-seeking.
Gather ever-enrolled
Filter by program.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load. Practice datasets. View data dictionaries via ?student, ?term, ?degree.

# Load practice data
data(student, term, degree)

Loads with midfieldr. Prepared data. View data dictionary via ?study_programs.

study_programs (derived in Programs).

Initial processing

Select (optional). Reduce the number of columns. Code reproduced from Getting started.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)

Initialize. Assign a working data frame.

# Working data frame
DT <- copy(term)

Data sufficiency. Filter to satisfy the data sufficiency criterion. Code reproduced from Data sufficiency.

# Filter for data sufficiency, output unique IDs
DT <- add_timely_term(DT, term)
DT <- add_data_sufficiency(DT, term)
DT <- DT[data_sufficiency == "include", .(mcid)]
DT <- unique(DT)

Degree seeking. Filter to retain degree seeking students via an inner join with student. Code reproduced from Degree seeking.

# Filter for degree seeking, output unique IDs
DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]
DT <- unique(DT)

Verify prepared data. Many analyses begin, as we do here, by filtering for data sufficiency and degree-seeking. For our convenience in subsequent articles, this set of IDs is included with midfieldr in the data set baseline_mcid. Here we verify that the two data frames have the same content.

# Demonstrate equivalence
check_equiv_frames(DT, baseline_mcid)

#| eval: false
#| echo: false

# Run manually
# Writing external files
setkey(DT, mcid)
setkey(DT, NULL)
baseline_mcid <- copy(DT)
usethis::use_data(baseline_mcid, overwrite = TRUE)

We preserve this data frame as a baseline for examples in the article.

baseline <- copy(DT)

Left joins

An left join is a merge operation between two data frames which returns all observations (rows) of the "left" data frame X and all the matching rows in the "right" data frame Y. Using data.table syntax, we have two approaches: merge(X, Y, by) (similar to base R) and Y[X, j, on] (native to data.table).

Using `merge(X, Y, by)`

The general form for a left join is

    merge(X, Y, by, all.x = TRUE)

where

X is the "left" data frame, all rows returned
Y is the "right" data frame, matching rows returned
by is the vector of shared column names to merge by
all.x = TRUE ensures the left join

In this example, the Y data frame is term, from which we extract the ID and CIP columns before the join.

# Subset of term data frame to join
cols_we_want <- term[, .(mcid, cip6)]

# merge(X, Y, by) left join
merge(DT, cols_we_want, by = c("mcid"), all.x = TRUE)

Alternatively, one can select Y columns within the merge operation.

# merge(X, Y, by) left join
merge(DT, term[, .(mcid, cip6)], by = c("mcid"), all.x = TRUE)

Using `Y[X, j, on]`

The second approach---native to data.table and computationally more efficient---has the form

    Y[X, j, on]

where

X is the "left" data frame, all rows returned
Y is the "right" data frame, matching rows returned
j selects columns from the joined data frame to retain
on is the vector of shared column names to merge on

# Y[X, j, on] left join (data.table native syntax)
term[DT, .(mcid, cip6), on = c("mcid")]

Demonstrate equivalence. Showing that the two approaches produce the same result.

# merge(X, Y, by) left join
x <- merge(DT, term[, .(mcid, cip6)], by = c("mcid"), all.x = TRUE)
setkey(x, NULL)

# Y[X, j, on] left join
y <- term[DT, .(mcid, cip6), on = c("mcid")]

# Demonstrate equivalence
check_equiv_frames(x, y)

Left join matching rules

Rows in X with no match in Y will have NA values in the columns normally filled with Y values.

For example, not all students in DT will earn a degree. After a left join (degree into DT), all rows of DT are returned. IDs in DT with no match in degree have an NA in the term_degree column (a variable from the degree source table).

#| echo: false
options(datatable.print.topn = 5)

x <- degree[DT, .(mcid, term_degree), on = c("mcid")]
setkeyv(x, c("mcid"))
x[]

The result has r length(unique(x$mcid)) unique IDs with r sum(!is.na(x$term_degree)) degrees.

Rows in X with multiple matches in Y yields a new row in X for every matching row in Y.

For example, most students in DT will be enrolled in multiple terms. After a left join (term into DT), all rows in DT are returned. IDs in DT with multiple matches in term have multiple rows in the result, differentiated by the values in the term column (a variable from the term source table).

x <- term[DT, .(mcid, term), on = c("mcid")]
setkeyv(x, c("mcid", "term"))
x[]

The result has r length(unique(x$mcid)) unique IDs distributed over r nrow(x) observations.

#| echo: false
options(datatable.print.topn = 3)

"Left" and "right" matter. In left joins (unlike inner joins), X[Y] and Y[X] return different results:

Y[X, j, on] returns all rows of X
X[Y, j, on] returns all rows of Y

# What we want
x <- degree[DT, .(mcid, term_degree), on = c("mcid")]

# Not what we want
y <- DT[degree, .(mcid, term_degree), on = c("mcid")]

# Same content?
check_equiv_frames(x, y)

# Compare N rows
nrow(x)
nrow(y)

Ever-enrolled

Work. The baseline data frame we preserved earlier is the intake for this section.

# Reusable starting state
DT <- copy(baseline)
DT[]

Add a variable. Use a left join from term to DT to add the CIP variable.

# Left-outer join from term to DT
DT <- term[DT, .(mcid, cip6), on = c("mcid")]

Filter. Filter to retain unique combinations if ID and CIP code.

# One observation per ID-CIP combination
DT <- unique(DT)
DT

Filter by program

Filter. Because "ever-enrolled" usually means "ever-enrolled in specific programs," this bloc concludes with a filter by program. Code reproduced from Groupings.

# Filter by program
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT[, cip6 := NULL]
DT <- unique(DT)
DT

Reusable code

Preparation. The baseline data frame we preserved earlier is the intake for this section.

DT <- copy(baseline)

Ever-enrolled. A summary code chunk for ready reference. Requires editing of study_programs before reuse with different programs.

# Ever-enrolled bloc
DT <- term[DT, .(mcid, cip6), on = c("mcid")]
DT <- unique(DT)

# Filter by program
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT[, cip6 := NULL]
DT <- unique(DT)

References

MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Left joins

Using `merge(X, Y, by)`

Using `Y[X, j, on]`

Left join matching rules

Ever-enrolled

Filter by program

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Left joins

Using merge(X, Y, by)

Using Y[X, j, on]

Left join matching rules

Ever-enrolled

Filter by program

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

Using `merge(X, Y, by)`

Using `Y[X, j, on]`