#| include: false
# code chunks
knitr::opts_chunk$set(
  fig.width = 8,
  out.width = "100%",
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  cache = FALSE,
  error = FALSE,
  tidy = FALSE,
  echo = TRUE
)

# inline numbers
knitr::knit_hooks$set(inline = function(x) {
  if (!is.numeric(x)) {
    x
  } else if (x >= 10000) {
    prettyNum(round(x, 2), big.mark = ",")
  } else {
    prettyNum(round(x, 2))
  }
})

# accented text
accent <- function(text_string) {
  kableExtra::text_spec(text_string, color = "#b35806", bold = TRUE)
}

# Backup user options (load packages to capture default options)
suppressPackageStartupMessages(library(data.table))
backup_options <- options()

# Backup user random number seed
oldseed <- NULL
if (exists(".Random.seed")) oldseed <- .Random.seed

# data.table printout
options(
  datatable.print.nrows = 10,
  datatable.print.topn = 3,
  datatable.print.class = FALSE
)


midfielddata is an R data package that supplies anonymized student-level records for 98,000 undergraduates from the MIDFIELD database. Provides practice data for the tools and methods of midfieldr.

Introduction

Data at the "student-level" refers to information collected by undergraduate institutions on individual students, including:

midfielddata provides anonymized student-level records for 98,000 undergraduates at three US institutions from 1988 through 2018, collected in four data tables keyed by student ID.

#| echo: false
wrapr::build_frame(
  "Dataset", "Each row is", "Students", "Rows", "Columns", "Memory" |
    "course", "one student per course", "97,555", "3,289,532", 12L, "324.3 MB" |
    "term", "one student per term", "97,555", "639,915", 13L, "72.8 MB" |
    "student", "one student", "97,555", "97,555", 13L, "17.3 MB" |
    "degree", "one student per degree", "49,543", "49,665", 5L, "5.2 MB"
) |>
  kableExtra::kbl(align = "llrrrr", caption = "Table 1. Practice datasets in `midfielddata`.") |>
  kableExtra::kable_paper(lightable_options = "basic", full_width = FALSE) |>
  kableExtra::row_spec(0, background = "#c7eae5") |>
  kableExtra::column_spec(1, monospace = TRUE) |>
  kableExtra::column_spec(1:6, color = "black", background = "white")

The data in midfielddata are a proportionate stratified sample of the MIDFIELD database, but are not suitable for drawing inferences about program attributes or student experiences---midfielddata are for practice, not research.

Notes on syntax.   We use data.table for data manipulation. Some users may prefer base R or dplyr. Each system has its strengths---users are welcome to translate our examples to their preferred syntax.

format(Sys.Date(), "%Y-%m-%d") # Today's date
packageVersion("midfielddata") # Student-level records practice data
packageVersion("data.table") # For data manipulation

Usage

Start.   If you are writing your own script to follow along, we use these packages in this vignette:

library(midfielddata)
library(data.table)

Load data tables.   Data tables can be loaded individually or collectively as needed.

# Load one table as needed
data(student)

# Or load multiple tables
data(course, term, degree)

We display the records for one specific student, using their ID to subset each dataset.

# One student ID
id_we_want <- "MCID3112192438"

Student.   As expected, student yields one row per student.

# Observations for a selected ID
student[mcid == id_we_want]

Course.   For this student, the records span r nrow(course[mcid == id_we_want]) rows, one row per course.

# Observations for a selected ID
course[mcid == id_we_want]

Term.   Here, the records span r nrow(term[mcid == id_we_want]) rows, one row per term.

# Observations for a selected ID
term[mcid == id_we_want]

Degree.   In this example, the records span r nrow(degree[mcid == id_we_want]) rows, one row per degree. The degrees were earned in the same term, Spring 2009.

# Observations for a selected ID
degree[mcid == id_we_want]

Not all students with more than one degree earn them in the same term. For example, the next student earned a degree in 1996 and a second degree in 1999. In most analyses, only the first baccalaureate degree would be used.

# Observations for a different ID
degree[mcid == "MCID3111315508"]
#| include: false

# Find duplicate IDs in degree
# DT <- copy(degree)
# idx <- which(duplicated(DT[, .(mcid)]))
# dup_ID <- DT[idx, .(mcid)]
# dup_degree <- dup_ID[degree, on = "mcid", nomatch = NULL]
# dup_degree

Installation

Install with:

#| eval: false
install.packages("midfielddata",
  repos = "https://MIDFIELDR.github.io/drat/",
  type = "source"
)

The installed size of midfielddata is about 24 Mb, so installation will take longer than that of a conventional CRAN package. Also because of its size, the package is not hosted on CRAN (with its 5 MB size limit)---instead, we host it on the MIDFIELDR drat repository as indicated above.

Link to installation instructions for midfieldr below.

More information

midfieldr

: A companion R package that provides tools and methods for studying undergraduate student-level records from the MIDFIELD database.

MIDFIELD

: A database of anonymized student-level records for approximately 2.4M undergraduates at 21 US institutions from 1987-2022. Access to this database requires a confidentiality agreement and Institutional Review Board (IRB) approval for human subjects research. For a detailed description of the database, see [@aee2016].

Acknowledgments

This work was supported by the US National Science Foundation through grant numbers 1545667 and 2142087.

References

#| echo: false

# Restore the user options (saved in common-setup.Rmd)
options(backup_options)

# Restore user random number seed if any
if (!is.null(oldseed)) {
  .Random.seed <- oldseed
}

# to change the CSS file
# per https://github.com/rstudio/rmarkdown/issues/732
knitr::opts_chunk$set(echo = FALSE)
blockquote {
    padding:     10px 20px;
    margin:      0 0 20px;
    border-left: 0px
}
caption {
    color:       #525252;
    text-align:  left;
    font-weight: normal;
    font-size:   medium;
    line-height: 1.5;
}


MIDFIELDR/midfielddata documentation built on May 18, 2024, 9:48 p.m.