#| include: false knitr::opts_chunk$set(fig.path = ".../man/figures/art-000-")
Familiarity with the MIDFIELD data structure is a prerequisite for working with midfieldr functions so we start by inspecting the practice data in midfielddata, a companion R package that provides anonymized student-level records for 98,000 undergraduates at three US institutions from 1988 through 2018. (If you haven't yet installed midfielddata, see the Installation instructions.)
The practice data are organized in four datasets keyed by student ID.
#| echo: false library(gt) wrapr::build_frame( "Dataset", "Each row is", "N students", "N rows", "N columns", "Memory" | "course", "one student per course & term", "97,555", "3,289,532", 12L, "324.3 MB" | "term", "one student per term", "97,555", "639,915", 13L, "72.8 MB" | "student", "one student", "97,555", "97,555", 13L, "17.3 MB" | "degree", "one student per degree", "49,543", "49,665", 5L, "5.2 MB" ) |> gt() |> tab_caption("Table 1. Practice datasets in midfielddata") |> tab_options(table.font.size = "small") |> opt_stylize(style = 1, color = "gray") |> tab_style( style = list(cell_fill(color = "#c7eae5")), locations = cells_column_labels(columns = everything()) )
observation
: Row of a data frame (student
, course
, term
, degree
) keyed by student ID.
variable
: Column of a data frame
In this article:
select_required()
and wrapr check_equiv_frames()
Start. If you are writing your own script to follow along, we use these packages in this article:
library(midfieldr) library(midfielddata) library(data.table)
Load data tables. Data tables can be loaded individually or collectively as needed. View data dictionaries via ?student
, ?course
, ?term
, or ?degree
.
# Load one table as needed data(student) # Or load multiple tables data(course, term, degree)
student
Contains one observation per student. Data are assumed to be current at the time the student was admitted to their institution.
student
Student IDs and institution names have been anonymized to remove identifiable information.
# Anonymized IDs sample(student$mcid, 8) # Anonymized institutions sort(unique(student$institution))
Race/ethnicity and sex are often used as grouping variables.
# Possible values sort(unique(student$race)) # Possible values sort(unique(student$sex))
Counts in each category.
# N by institution student[order(institution), .N, by = "institution"] # N by race student[order(race), .N, by = "race"] # N by sex student[order(sex), .N, by = "sex"]
course
Contains one observation per student per course.
course
The abbrev
, number
, and discipline_midfield
columns have no NA values, so they might be useful if one is filtering for specific course types. The course
column, on the other hand, has a high number of NA values.
# Many NA values in this column sum(is.na(course$course)) # No NA values in these columns. sum(is.na(course$abbrev)) sum(is.na(course$number)) sum(is.na(course$discipline_midfield))
term
Contains one observation per student per term.
term
Terms are encoded YYYYT
, where
YYYY
is the year at the start of the academic year, and T
encodes the semester or quarter---Fall (1
), Winter (2
), Spring (3
), and Summer (4
, 5
, and 6
)---within an academic yearFor example, for academic year 1995--96,
19951
encodes Fall 95--9619953
encodes Spring 95--9619954
encodes Summer 95--96 (first session)Different institutions supply data over different time spans.
# Range of data by institution term[, .(min_term = min(term), max_term = max(term)), by = "institution"]
Programs are encoded in the cip6
variable, a 6-digit character based on the 2010 Classification of Instructional Programs (CIP) [@NCES:2010].
# A sample of cip6 values sort(unique(sample(term$cip6, 8)))
Student level is used when determining timely completion terms of transfer students.
# Possible values sort(unique(term$level))
degree
Contains one observation per student per degree.
This dataset contains records for graduates only, thus the number of observations in degree
(r nrow(degree)
) is less than the number of observations in student
(r nrow(student)
). The term_degree
and cip6
variables indicate when and from which program a student graduates.
degree
Number of degrees earned per student.
# Count students by number of degrees by_id <- degree[, .(degree_count = .N), by = "mcid"] by_id[, .(N_students = .N), by = "degree_count"]
We display the records for one specific student, using their ID to subset each dataset.
# One student ID id_we_want <- "MCID3112192438"
Student. As expected, student
yields one row per student.
# Observations for a selected ID student[mcid == id_we_want]
Course. For this student, the records span r nrow(course[mcid == id_we_want])
rows, one row per course.
# Observations for a selected ID course[mcid == id_we_want]
Term. Here, the records span r nrow(term[mcid == id_we_want])
rows, one row per term.
# Observations for a selected ID term[mcid == id_we_want]
Degree. In this example, the records span r nrow(degree[mcid == id_we_want])
rows, one row per degree. The degrees were earned in the same term, Spring 2009.
# Observations for a selected ID degree[mcid == id_we_want]
Not all students with more than one degree earn them in the same term. For example, the next student earned a degree in 1996 and a second degree in 1999. In most analyses, only the first baccalaureate degree would be used.
# Observations for a different ID degree[mcid == "MCID3111315508"]
select_required()
A midfieldr convenience function to reduce the number of columns of a MIDFIELD data table after loading. Using this function is optional.
select_required()
selects only those columns typically required by other midfieldr functions. Operates on a data frame to retain columns having names that match or partially match search terms. Rows are unaffected.
The primary benefit is reducing screen clutter when viewing data frames during an interactive session. The disadvantage is that the deleted columns are unavailable unless you first set aside a copy of the source file or reload it using data()
when you need it.
Arguments.
midfield_x
MIDFIELD data frame, typically student
, term
, or degree
.
select_add
Optional character vector of search terms to add to the default vector given by c("mcid", "institution", "race", "sex", "^term", "cip6", "level")
. Argument, if used, must be used by name.
For example, term records are significantly more compact if we select this minimum set of columns.
# Select variables required by midfieldr functions select_required(term)
We can add columns if we need them.
# Select additional columns select_required(term, select_add = c("gpa_term"))
check_equiv_frames()
A function imported from the wrapr package that confirms two data frames are equivalent after reordering columns and rows. Accessible by loading midfieldr.
Example. Demonstrate that the following implementations of select_required()
yield identical results.
# Required argument explicitly named x <- select_required(midfield_x = term) # Required argument not named y <- select_required(term) # Optional argument, if used, must be named. NULL yields the default columns. z <- select_required(term, select_add = NULL) # Demonstrate equivalence check_equiv_frames(x, y) check_equiv_frames(x, z)
Demonstrate that row and column order are ignored.
# Two columns from student, use key to order rows x <- student[, .(mcid, institution)] setkey(x, mcid) x # Same information with different row order, column order, and key y <- student[, .(institution, mcid)] setkey(y, institution) y # Demonstrate equivalence check_equiv_frames(x, y)
If the two data tables do not have the same content, the return is FALSE.
# Demonstrate non-equivalence check_equiv_frames(student, degree)
To explore the differences between non-equivalent data frames, janitor::compare_df_cols()
returns a comparison of column names and class.
library(janitor) compare_df_cols(student, degree)
Preparation. The immediate prerequisites or “intake” required by the reusable code chunk are the source data tables.
#| eval: false # Load source data data(student, term, degree)
Initial data processing. A summary code chunk for ready reference.
# Optional. Copy of source files with all variables source_student <- copy(student) source_term <- copy(term) source_degree <- copy(degree) # Optional. Select variables required by midfieldr functions student <- select_required(source_student) term <- select_required(source_term) degree <- select_required(source_degree)
The copy()
function ensures that "by-reference" changes to student
, for example, have no effect on source_student
[@data.table-reference-semantics]. Thus the source_*
objects retain all the original columns, if needed later.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.