#| include: false
knitr::opts_chunk$set(fig.path = ".../man/figures/art-000-")

Familiarity with the MIDFIELD data structure is a prerequisite for working with midfieldr functions so we start by inspecting the practice data in midfielddata, a companion R package that provides anonymized student-level records for 98,000 undergraduates at three US institutions from 1988 through 2018. (If you haven't yet installed midfielddata, see the Installation instructions.)

The practice data are organized in four datasets keyed by student ID.

#| echo: false
library(gt)
wrapr::build_frame(
  "Dataset", "Each row is", "N students", "N rows", "N columns", "Memory" |
    "course", "one student per course & term", "97,555", "3,289,532", 12L, "324.3 MB" |
    "term", "one student per term", "97,555", "639,915", 13L, "72.8 MB" |
    "student", "one student", "97,555", "97,555", 13L, "17.3 MB" |
    "degree", "one student per degree", "49,543", "49,665", 5L, "5.2 MB"
) |>
  gt() |>
  tab_caption("Table 1. Practice datasets in midfielddata") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )

Definitions


observation

: Row of a data frame (student, course, term, degree) keyed by student ID.

variable

: Column of a data frame

Method

In this article:


Load data

Start.   If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load data tables.   Data tables can be loaded individually or collectively as needed. View data dictionaries via ?student, ?course, ?term, or ?degree.

# Load one table as needed
data(student)

# Or load multiple tables
data(course, term, degree)

student

Contains one observation per student. Data are assumed to be current at the time the student was admitted to their institution.

student

Student IDs and institution names have been anonymized to remove identifiable information.

# Anonymized IDs
sample(student$mcid, 8)

# Anonymized institutions
sort(unique(student$institution))

Race/ethnicity and sex are often used as grouping variables.

# Possible values
sort(unique(student$race))

# Possible values
sort(unique(student$sex))

Counts in each category.

# N by institution
student[order(institution), .N, by = "institution"]

# N by race
student[order(race), .N, by = "race"]

# N by sex
student[order(sex), .N, by = "sex"]

course

Contains one observation per student per course.

course

The abbrev, number, and discipline_midfield columns have no NA values, so they might be useful if one is filtering for specific course types. The course column, on the other hand, has a high number of NA values.

# Many NA values in this column
sum(is.na(course$course))

# No NA values in these columns.
sum(is.na(course$abbrev))
sum(is.na(course$number))
sum(is.na(course$discipline_midfield))

term

Contains one observation per student per term.

term

Terms are encoded YYYYT, where

For example, for academic year 1995--96,

Different institutions supply data over different time spans.

# Range of data by institution
term[, .(min_term = min(term), max_term = max(term)), by = "institution"]

Programs are encoded in the cip6 variable, a 6-digit character based on the 2010 Classification of Instructional Programs (CIP) [@NCES:2010].

# A sample of cip6 values
sort(unique(sample(term$cip6, 8)))

Student level is used when determining timely completion terms of transfer students.

# Possible values
sort(unique(term$level))

degree

Contains one observation per student per degree.

This dataset contains records for graduates only, thus the number of observations in degree (r nrow(degree)) is less than the number of observations in student (r nrow(student)). The term_degree and cip6 variables indicate when and from which program a student graduates.

degree

Number of degrees earned per student.

# Count students by number of degrees
by_id <- degree[, .(degree_count = .N), by = "mcid"]
by_id[, .(N_students = .N), by = "degree_count"]

Closer look

We display the records for one specific student, using their ID to subset each dataset.

# One student ID
id_we_want <- "MCID3112192438"

Student.   As expected, student yields one row per student.

# Observations for a selected ID
student[mcid == id_we_want]

Course.   For this student, the records span r nrow(course[mcid == id_we_want]) rows, one row per course.

# Observations for a selected ID
course[mcid == id_we_want]

Term.   Here, the records span r nrow(term[mcid == id_we_want]) rows, one row per term.

# Observations for a selected ID
term[mcid == id_we_want]

Degree.   In this example, the records span r nrow(degree[mcid == id_we_want]) rows, one row per degree. The degrees were earned in the same term, Spring 2009.

# Observations for a selected ID
degree[mcid == id_we_want]

Not all students with more than one degree earn them in the same term. For example, the next student earned a degree in 1996 and a second degree in 1999. In most analyses, only the first baccalaureate degree would be used.

# Observations for a different ID
degree[mcid == "MCID3111315508"]

select_required()

A midfieldr convenience function to reduce the number of columns of a MIDFIELD data table after loading. Using this function is optional.

select_required() selects only those columns typically required by other midfieldr functions. Operates on a data frame to retain columns having names that match or partially match search terms. Rows are unaffected.

The primary benefit is reducing screen clutter when viewing data frames during an interactive session. The disadvantage is that the deleted columns are unavailable unless you first set aside a copy of the source file or reload it using data() when you need it.

Arguments.

For example, term records are significantly more compact if we select this minimum set of columns.

# Select variables required by midfieldr functions
select_required(term)

We can add columns if we need them.

# Select additional columns
select_required(term, select_add = c("gpa_term"))

check_equiv_frames()

A function imported from the wrapr package that confirms two data frames are equivalent after reordering columns and rows. Accessible by loading midfieldr.

Example.   Demonstrate that the following implementations of select_required() yield identical results.

# Required argument explicitly named
x <- select_required(midfield_x = term)

# Required argument not named
y <- select_required(term)

# Optional argument, if used, must be named. NULL yields the default columns.
z <- select_required(term, select_add = NULL)

# Demonstrate equivalence
check_equiv_frames(x, y)
check_equiv_frames(x, z)

Demonstrate that row and column order are ignored.

# Two columns from student, use key to order rows
x <- student[, .(mcid, institution)]
setkey(x, mcid)
x

# Same information with different row order, column order, and key
y <- student[, .(institution, mcid)]
setkey(y, institution)
y

# Demonstrate equivalence
check_equiv_frames(x, y)

If the two data tables do not have the same content, the return is FALSE.

# Demonstrate non-equivalence
check_equiv_frames(student, degree)

To explore the differences between non-equivalent data frames, janitor::compare_df_cols() returns a comparison of column names and class.

library(janitor)
compare_df_cols(student, degree)

Reusable code

Preparation.   The immediate prerequisites or “intake” required by the reusable code chunk are the source data tables.

#| eval: false
# Load source data
data(student, term, degree)

Initial data processing.   A summary code chunk for ready reference.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)

The copy() function ensures that "by-reference" changes to student, for example, have no effect on source_student [@data.table-reference-semantics]. Thus the source_* objects retain all the original columns, if needed later.

References




MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.