#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-030-")
Most analyses of student-level records omit records of students not seeking degrees. We use an inner join merging operation as a degree-seeking filter. This article treats degree seeking and inner joins generally.
This article in the MIDFIELD workflow.
By design, the student
data table contains records of degree-seeking students only. We use an inner join with student
to filter our working data frame for degree-seeking students.
Filter source SURs for data sufficiency.
Filter for degree seeking.
Start. If you are writing your own script to follow along, we use these packages in this article:
library(midfieldr) library(midfielddata) library(data.table)
Load. Practice datasets. View data dictionaries via ?student
, ?term
.
# Load practice data data(student, term)
Select (optional). Reduce the number of columns. Code reproduced from Getting started.
# Optional. Copy of source files with all variables source_student <- copy(student) source_term <- copy(term) # Optional. Select variables required by midfieldr functions student <- select_required(source_student) term <- select_required(source_term)
Initialize. Assign a working data frame.
# Working data frame DT <- copy(term)
Data sufficiency. Filter to satisfy the data sufficiency criterion. Code reproduced from Data sufficiency.
# Filter for data sufficiency, output unique IDs DT <- add_timely_term(DT, term) DT <- add_data_sufficiency(DT, term) DT <- DT[data_sufficiency == "include", .(mcid)] DT <- unique(DT)
We preserve this data frame as a baseline for examples in the article.
baseline <- copy(DT)
An inner join is a merge operation that returns all observations (rows) from two data frames that match specified conditions in both. Using data.table syntax, we have two approaches: merge(X, Y, by)
(similar to base R) and Y[X, j, on]
(native to data.table).
merge(X, Y, by)
The general form for an inner join is
merge(X, Y, by, all = FALSE)
where
X
is a data frame, matching rows returnedY
is a data frame, matching rows returnedall = FALSE
ensures the inner join In this example, the Y
data frame is student
, from which we extract the ID column before joining. Otherwise, all columns from both data frames would be returned.
# Select columns in Y cols_we_want <- student[, .(mcid)] # merge(X, Y) inner join merge(DT, cols_we_want, by = c("mcid"), all = FALSE)
Y[X, j, on]
The second approach---native to data.table and computationally more efficient---has the form
Y[X, j, on, nomatch = NULL]
where
X
is a data frame, matching rows returnedY
is a data frame, matching rows returnedj
selects columns from the joined data frame to retain (default all columns)on
is the vector of shared column names to merge by nomatch = NULL
ensures the inner join # Y[X] inner join DT[student, .(mcid), on = c("mcid"), nomatch = NULL]
Demonstrate equivalence. Showing that the two approaches produce the same result and that, for inner joins, X
and Y
are interchangeable.
# merge(X, Y) w <- merge(DT, cols_we_want, by = c("mcid"), all = FALSE) # merge(Y, X) x <- merge(cols_we_want, DT, by = c("mcid"), all = FALSE) # X[Y] y <- DT[student, .(mcid), on = c("mcid"), nomatch = NULL] # Y[X] z <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL] # Demonstrate equivalence check_equiv_frames(w, x) check_equiv_frames(w, y) check_equiv_frames(w, z)
In either method, we can select columns from both data frames. Using merge()
we select the columns by explicitly subsetting the two data frames.
# Selecting columns from both data frames, merge() inner join x <- merge(DT[, .(mcid)], student[, .(mcid, institution)], by = c("mcid"), all = FALSE) setkey(x, NULL) x
In the X[Y, j]
syntax, however, we can list the columns to be returned from both data frames in the j
list, that is, .(var1, var2, etc.)
, without subsetting the original two data frames.
# Selecting columns from both data frames, X[Y] inner join y <- DT[student, .(mcid, institution), on = c("mcid"), nomatch = NULL] y
Demonstrate equivalence. Showing that the two approaches produce the same result.
# Demonstrate equivalence check_equiv_frames(x, y)
Continue. The baseline data frame we preserved earlier is the intake for this section.
# Reusable starting state DT <- copy(baseline) DT
Filter. Use an inner join with student
to filter DT
to retain degree-seeking students.
# Inner join for degree seeking DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]
Filter. Filter to ensure IDs are unique.
# One observation per ID DT <- unique(DT) DT
Preparation. The baseline data frame we preserved earlier is the intake for this section.
DT <- copy(baseline)
Degree seeking. A summary code chunk for ready reference.
# Filter for degree seeking, output unique IDs DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL] DT <- unique(DT)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.