midfielddata is an R data package that supplies anonymized student-level records for 98,000 undergraduates from the MIDFIELD database. Provides practice data for the tools and methods of midfieldr.


Data at the "student-level" refers to information collected by undergraduate institutions on individual students, including:

midfielddata provides anonymized student-level records for 98,000 undergraduates at three US institutions from 1988 through 2018, collected in four data tables keyed by student ID.

  "Dataset", "Each row is", "Students", "Rows", "Columns", "Memory" |
    "course", "one student per course", "97,555", "3,289,532", 12L, "324.3 MB" |
    "term", "one student per term", "97,555", "639,915", 13L, "72.8 MB" |
    "student", "one student", "97,555", "97,555", 13L, "17.3 MB" |
    "degree", "one student per degree", "49,543", "49,665", 5L, "5.2 MB"
) |>
  kableExtra::kbl(align = "llrrrr", caption = "Table 1. Practice datasets in `midfielddata`.") |>
  kableExtra::kable_paper(lightable_options = "basic", full_width = FALSE) |>
  kableExtra::row_spec(0, background = "#c7eae5") |>
  kableExtra::column_spec(1, monospace = TRUE) |>
  kableExtra::column_spec(1:6, color = "black", background = "white")

The data in midfielddata are a proportionate stratified sample of the MIDFIELD database, but are not suitable for drawing inferences about program attributes or student experiences---midfielddata are for practice, not research.

Notes on syntax.   We use data.table for data manipulation. Some users may prefer base R or dplyr. Each system has its strengths---users are welcome to translate our examples to their preferred syntax.

Start.   If you are writing your own script to follow along, we use these packages in this vignette:


Load data tables.   Data tables can be loaded individually or collectively as needed.

# Load one table as needed

# Or load multiple tables
data(course, term, degree)

We display the records for one specific student, using their ID to subset each dataset.

# One student ID
id_we_want <- "MCID3112192438"

Student.   As expected, student yields one row per student.

# Observations for a selected ID
student[mcid == id_we_want]

Course.   For this student, the records span r nrow(course[mcid == id_we_want]) rows, one row per course.

# Observations for a selected ID
course[mcid == id_we_want]

Term.   Here, the records span r nrow(term[mcid == id_we_want]) rows, one row per term.

# Observations for a selected ID
term[mcid == id_we_want]

Degree.   In this example, the records span r nrow(degree[mcid == id_we_want]) rows, one row per degree. The degrees were earned in the same term, Spring 2009.

# Observations for a selected ID
degree[mcid == id_we_want]

Not all students with more than one degree earn them in the same term. For example, the next student earned a degree in 1996 and a second degree in 1999. In most analyses, only the first baccalaureate degree would be used.

# Observations for a different ID
degree[mcid == "MCID3111315508"]
Install with:

The installed size of midfielddata is about 24 Mb, so installation will take longer than that of a conventional CRAN package. Also because of its size, the package is not hosted on CRAN (with its 5 MB size limit)---instead, we host it on the MIDFIELDR drat repository as indicated above.

Link to installation instructions for midfieldr below.

More information


: A companion R package that provides tools and methods for studying undergraduate student-level records from the MIDFIELD database.


: A database of anonymized student-level records for approximately 2.4M undergraduates at 21 US institutions from 1987-2022. Access to this database requires a confidentiality agreement and Institutional Review Board (IRB) approval for human subjects research. For a detailed description of the database, see [@aee2016].


This work was supported by the US National Science Foundation through grant numbers 1545667 and 2142087.


