In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-030-")

Most analyses of student-level records omit records of students not seeking degrees. We use an inner join merging operation as a degree-seeking filter. This article treats degree seeking and inner joins generally.

This article in the MIDFIELD workflow.

Planning
[Initial processing]{.accent}
- Data sufficiency
- [Degree seeking]{.accent}
- Identify programs
Blocs
Groupings
Metrics
Displays

Definitions

Method

By design, the student data table contains records of degree-seeking students only. We use an inner join with student to filter our working data frame for degree-seeking students.

Filter source SURs for data sufficiency.
Filter for degree seeking.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load. Practice datasets. View data dictionaries via ?student, ?term.

# Load practice data
data(student, term)

Initial processing

Select (optional). Reduce the number of columns. Code reproduced from Getting started.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)

Initialize. Assign a working data frame.

# Working data frame
DT <- copy(term)

Data sufficiency. Filter to satisfy the data sufficiency criterion. Code reproduced from Data sufficiency.

# Filter for data sufficiency, output unique IDs
DT <- add_timely_term(DT, term)
DT <- add_data_sufficiency(DT, term)
DT <- DT[data_sufficiency == "include", .(mcid)]
DT <- unique(DT)

We preserve this data frame as a baseline for examples in the article.

baseline <- copy(DT)

Inner joins

An inner join is a merge operation that returns all observations (rows) from two data frames that match specified conditions in both. Using data.table syntax, we have two approaches: merge(X, Y, by) (similar to base R) and Y[X, j, on] (native to data.table).

Using `merge(X, Y, by)`

The general form for an inner join is

    merge(X, Y, by, all = FALSE)

where

X is a data frame, matching rows returned
Y is a data frame, matching rows returned
by is the vector of shared column names to merge by
all = FALSE ensures the inner join

In this example, the Y data frame is student, from which we extract the ID column before joining. Otherwise, all columns from both data frames would be returned.

# Select columns in Y
cols_we_want <- student[, .(mcid)]

# merge(X, Y) inner join
merge(DT, cols_we_want, by = c("mcid"), all = FALSE)

Using `Y[X, j, on]`

The second approach---native to data.table and computationally more efficient---has the form

    Y[X, j, on, nomatch = NULL]

where

X is a data frame, matching rows returned
Y is a data frame, matching rows returned
j selects columns from the joined data frame to retain (default all columns)
on is the vector of shared column names to merge by
nomatch = NULL ensures the inner join

# Y[X] inner join
DT[student, .(mcid), on = c("mcid"), nomatch = NULL]

Demonstrate equivalence. Showing that the two approaches produce the same result and that, for inner joins, X and Y are interchangeable.

# merge(X, Y)
w <- merge(DT, cols_we_want, by = c("mcid"), all = FALSE)
# merge(Y, X)
x <- merge(cols_we_want, DT, by = c("mcid"), all = FALSE)
# X[Y]
y <- DT[student, .(mcid), on = c("mcid"), nomatch = NULL]
# Y[X]
z <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]

# Demonstrate equivalence
check_equiv_frames(w, x)
check_equiv_frames(w, y)
check_equiv_frames(w, z)

Selecting columns

In either method, we can select columns from both data frames. Using merge() we select the columns by explicitly subsetting the two data frames.

# Selecting columns from both data frames, merge() inner join
x <- merge(DT[, .(mcid)], student[, .(mcid, institution)], by = c("mcid"), all = FALSE)
setkey(x, NULL)
x

In the X[Y, j] syntax, however, we can list the columns to be returned from both data frames in the j list, that is, .(var1, var2, etc.), without subsetting the original two data frames.

# Selecting columns from both data frames, X[Y] inner join
y <- DT[student, .(mcid, institution), on = c("mcid"), nomatch = NULL]
y

Demonstrate equivalence. Showing that the two approaches produce the same result.

# Demonstrate equivalence
check_equiv_frames(x, y)

Degree seeking

Continue. The baseline data frame we preserved earlier is the intake for this section.

# Reusable starting state
DT <- copy(baseline)
DT

Filter. Use an inner join with student to filter DT to retain degree-seeking students.

# Inner join for degree seeking
DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]

Filter. Filter to ensure IDs are unique.

# One observation per ID
DT <- unique(DT)
DT

Reusable code

Preparation. The baseline data frame we preserved earlier is the intake for this section.

DT <- copy(baseline)

Degree seeking. A summary code chunk for ready reference.

# Filter for degree seeking, output unique IDs
DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]
DT <- unique(DT)

References

MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Inner joins

Using `merge(X, Y, by)`

Using `Y[X, j, on]`

Selecting columns

Degree seeking

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Inner joins

Using merge(X, Y, by)

Using Y[X, j, on]

Selecting columns

Degree seeking

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

Using `merge(X, Y, by)`

Using `Y[X, j, on]`