In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-090-")

We add grouping variables from the MIDFIELD data tables to our blocs in progress. We select these variables to provide the aggregating categories we want for a particular metric. Program labels and student demographics are two of the most common sets of grouping variables we use.

This vignette in the MIDFIELD workflow.

Planning
Initial processing
Blocs
[Groupings]{.accent}
- [Program labels]{.accent}
- [Demographics]{.accent}
- [Other variables]{.accent}
Metrics
Displays

Definitions

Method

We join grouping variables to a bloc after initial processing (data sufficiency and degree seeking) and any other subsetting criteria that define a bloc. The two most common join operations to add grouping variables are:

Program labels using an inner join on CIPs.
Demographics using a left join on IDs.

Other variables too can be usefully joined for grouping operations. We include examples from student, term, and degree.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load. Practice datasets. View data dictionaries via ?student, ?term, ?degree.

# Load practice data
data(student, term, degree)

Loads with midfieldr. Prepared data. View data dictionaries via ?study_programs, ?baseline_mcid.

study_programs (derived in Programs).
baseline_mcid (derived in Blocs).

Initial processing

Select (optional). Reduce the number of columns. Code reproduced from Getting started.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)

# Working data frame
DT <- copy(baseline_mcid)

We join a CIP variable for examples that join by CIP.

# Reusable starting state with CIP
baseline_cip <- term[DT, .(mcid, cip6), on = c("mcid")]
baseline_cip <- unique(baseline_cip)
baseline_cip

Program labels

At this point in a typical workflow, we have a bloc of student-level records in progress and a data frame of program labels (similar to study_programs). Both data frames have a 6-digit CIP variable to join by.

Program labels serve two main functions:

Filtering variable to finalize a bloc. For example, "starters" or "graduates" usually mean starters or graduates in specific programs. Thus a bloc procedure typically concludes with a program filter as in Ever-enrolled, Starters, or Graduates.
Grouping variable for summarizing data. Having filtered a bloc to retain records in specific programs, the program label is retained and used with other grouping variables such as race/ethnicity and sex when computing and comparing metrics. Because of its role in groupings, the program label join is developed in detail in this vignette.

Rationale for the inner join. An inner join accomplishes two tasks: adds a column of program labels to the bloc; and filters the bloc to retain only those observations with CIPs matching the desired programs.

Reset The CIP baseline data frame we preserved earlier is the intake for this section.

# Reusable starting state
DT <- copy(baseline_cip)

Filter. An inner join adds the program label and filters for matching CIPs.

# Join program labels via inner join
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT

We can see the filtering effect by noting that the baseline data frame had r nrow(baseline_cip) observations while the inner join on the selected programs returned r nrow(DT) observations. We can also verify the selected programs, e.g.,

# Verify program labels
sort(unique(DT$program))

# Verify program CIP codes
sort(unique(DT$cip6))

Students can migrate between majors having different 6-digit CIP codes, but those codes might be grouped under a single program label. A common example in Engineering is the "Industrial/Systems Engineering" label we assign to following CIP codes:

142701 Systems Engineering
143501 Industrial Engineering
143601 Manufacturing Engineering
143701 Operations Research

A student migrating among these CIPs would appear in multiple rows in the current bloc, yet we would not consider their change of CIP a change of major. The next step addresses this anomaly.

Select. Drop the CIP code.

# Prepare to filter
DT[, cip6 := NULL]
DT

Filter. Filter for unique observations.

# Case study ever enrolled
DT <- unique(DT)
DT

The difference in the number of observations indicates that this example includes one student in ISE with two CIPs. We examine that student's record below.

Closer look

#| echo: false
# find student ID with two CIPs in ISE
# x <- study_programs[term, on = "cip6", nomatch = NULL]
# x <- x[program == "ISE"]
# x <- x[, .(mcid, cip6, program)]
# x <- unique(x)
# x[duplicated(x$mcid)]
# x[mcid == "MCID3111251565"]

Examining the records of selected students in detail.

Example 1. This student's term record includes CIP 142701 (Systems Engineering) and CIP 143501 (Industrial Engineering), both of which are majors in our combined Industrial/Systems Engineering (ISE) major, illustrating our rationale for filtering for unique observations by ID/program pairs and not ID/CIP pairs.

# All terms, one ID
x <- term[mcid == "MCID3111251565", .(mcid, cip6)]

# Join case study program labels
x <- study_programs[x, on = c("cip6"), nomatch = NULL]

# Unique CIPs for this student
unique(x)

Demographics

Demographic variables (race/ethnicity and sex) are regularly left-joined to blocs for grouping and summarizing.

Reset The data frame of baseline IDs is the intake for this section.

# Reusable starting state
DT <- copy(baseline_mcid)
DT

Select. From student, select the join-by variable (student ID) and the variables we want to add. By selecting columns here, we don't have to select columns in the join operation to follow.

# Extract desired columns
cols_we_want <- student[, .(mcid, race, sex)]
cols_we_want

Add variables. Left join two data frames, retaining all variables from both.

# Add demographics
DT <- cols_we_want[DT, on = c("mcid")]
DT

Unknown `race` or `sex`

We often want to remove records for which race/ethnicity or sex are "unknown".

# Display values
unique(DT$race)
unique(DT$sex)

Filter. In data.table syntax, we can use x %ilike% pattern as a case-insensitive wrapper around grepl() to find matches and partial matches.

# Remove records with unknown sex, if any
x <- copy(DT)
x <- x[!sex %ilike% "unknown"]
x

Removing unknown race observations is similar.

# Remove records with unknown sex, if any
x <- x[!race %ilike% "unknown"]
x

Alternatively, these statements can be combined.

# Remove unknowns in either of two columns
DT <- DT[!(sex %ilike% "unknown" | race %ilike% "unknown")]

# Verify equivalence
check_equiv_frames(x, DT)

With "unknowns" removed, the race and sex values are:

sort(unique(DT$race))
sort(unique(DT$sex))

Add `origin`

origin is a demographic variable we use to distinguish "domestic" students from "international" students. The variable is a recoding of the race variable.

Add a variable. Assuming that race/ethnicity "unknown" have been removed, we use a conditional assignment to create the "origin" variable.

# Two values for origin
x <- copy(DT)
x <- x[, origin := fifelse(race == "International", "International", "Domestic")]
x[]

With "unknowns" removed, the origin values are:

sort(unique(x$origin))

Add `people`

people is a demographic variable we use in many of our summaries. The variable combines the race and sex variables.

Add a variable. We combine race/ethnicity and sex to create a grouping variable.

x <- copy(DT)
x <- x[, people := paste(race, sex)]
x

With "unknowns" removed, the people values are:

sort(unique(x$people))

Add `people` by `origin`

Combining the two ideas above, again assuming that the observations on unknown race/ethnicity and sex have been removed,

# Two values for origin
x <- copy(DT)
x <- x[, origin := fifelse(race == "International", "International", "Domestic")]

# Combine with sex
x[, people := paste(origin, sex)]

# Omit unnecessary variables
x <- x[, .(mcid, people)]
x

The possible people values are:

sort(unique(x$people))

Other variables

Depending on one's research question, any number of MIDFIELD variables might be used for grouping records. In this section we illustrate joining other variables from student, term, and degree to a working data frame.

We use the original source files copied earlier because some variables we want to use were removed when we applied select_required().

From `student`

Reset Reset the working data frame.

# Reusable starting state
DT <- copy(baseline_mcid)
DT

The available variables in the source student data are:

# Variables in the practice data set
names(source_student)

Select. Select our variables and the key (ID).

# Extract desired columns
cols_we_want <- source_student[, .(mcid, transfer, hours_transfer)]

Add variables. Left join to add new columns.

# Add desired columns
cols_we_want[DT, on = c("mcid")]

From `term`

Reset Reset the working data frame.

# Reusable starting state
DT <- copy(baseline_mcid)

The available variables in the source term data are:

# Variables in the practice data set
names(source_term)

Select. Select our variables and the key (ID).

# Extract desired columns
cols_we_want <- source_term[, .(mcid, term, hours_term, gpa_term)]

Add variables. Left join to add new columns.

# Add desired columns
cols_we_want[DT, on = c("mcid")]

#| echo: false

x <- cols_we_want[DT, on = c("mcid")]
y <- nrow(x)
z <- length(unique(x$mcid))

Rows in X with multiple matches in Y. Consistent with the left join matching rules, students in enrolled in multiple terms will have multiple rows in the joined data frame. Thus this result has r y observations of r z unique students.

From `degree`

Reset Reset the working data frame.

# Reusable starting state
DT <- copy(baseline_mcid)
DT

The available variables in the source degree data are:

# Variables in the practice data set
names(source_degree)

Select. Select two variables and the key (ID).

# Extract desired columns
cols_we_want <- source_degree[, .(mcid, cip6, degree)]

Add variables. Left join to add new columns.

# Add desired columns
cols_we_want[DT, on = c("mcid")]

#| echo: false
x <- cols_we_want[DT, on = c("mcid")]
y <- length(unique(x$mcid))
z <- x[!is.na(cip6)]
z <- z[, .(mcid)]
z <- length(unique(z$mcid))

Rows in X with no match in Y. Consistent with the left join matching rules, students in DT who do not graduate will have NA values in the term_degree and cip6 columns of the joined data tables. Thus this result has r y unique students of whom r z earned degrees.

Reusable code

Program labels preparation. The CIP baseline data frame we preserved earlier is the intake for this section.

DT <- copy(baseline_cip)

Program labels. A summary code chunk for ready reference. In gathering a bloc of starters, the join-by variable might be start instead of cip6.

# Filter by program
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT[, cip6 := NULL]
DT <- unique(DT)

Demographics preparation. The data frame of baseline IDs is the intake for this section.

DT <- copy(baseline_mcid)

Demographics. A summary code chunk for ready reference.

# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]

References

MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Program labels

Closer look

Demographics

Unknown `race` or `sex`

Add `origin`

Add `people`

Add `people` by `origin`

Other variables

From `student`

From `term`

From `degree`

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Program labels

Closer look

Demographics

Unknown race or sex

Add origin

Add people

Add people by origin

Other variables

From student

From term

From degree

Reusable code

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

Unknown `race` or `sex`

Add `origin`

Add `people`

Add `people` by `origin`

From `student`

From `term`

From `degree`