#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-090-")

We add grouping variables from the MIDFIELD data tables to our blocs in progress. We select these variables to provide the aggregating categories we want for a particular metric. Program labels and student demographics are two of the most common sets of grouping variables we use.

This vignette in the MIDFIELD workflow.

  1. Planning
  2. Initial processing
  3. Blocs
  4. [Groupings]{.accent}
    • [Program labels]{.accent}
    • [Demographics]{.accent}
    • [Other variables]{.accent}
  5. Metrics
  6. Displays

Definitions



Method

We join grouping variables to a bloc after initial processing (data sufficiency and degree seeking) and any other subsetting criteria that define a bloc. The two most common join operations to add grouping variables are:

Other variables too can be usefully joined for grouping operations. We include examples from student, term, and degree.


Load data

Start.   If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(midfielddata)
library(data.table)

Load.   Practice datasets. View data dictionaries via ?student, ?term, ?degree.

# Load practice data
data(student, term, degree)

Loads with midfieldr.   Prepared data. View data dictionaries via ?study_programs, ?baseline_mcid.

Initial processing

Select (optional).   Reduce the number of columns. Code reproduced from Getting started.

# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)

# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)

# Working data frame
DT <- copy(baseline_mcid)

We join a CIP variable for examples that join by CIP.

# Reusable starting state with CIP
baseline_cip <- term[DT, .(mcid, cip6), on = c("mcid")]
baseline_cip <- unique(baseline_cip)
baseline_cip

Program labels

At this point in a typical workflow, we have a bloc of student-level records in progress and a data frame of program labels (similar to study_programs). Both data frames have a 6-digit CIP variable to join by.

Program labels serve two main functions:

Rationale for the inner join.   An inner join accomplishes two tasks: adds a column of program labels to the bloc; and filters the bloc to retain only those observations with CIPs matching the desired programs.

Reset   The CIP baseline data frame we preserved earlier is the intake for this section.

# Reusable starting state
DT <- copy(baseline_cip)

Filter.   An inner join adds the program label and filters for matching CIPs.

# Join program labels via inner join
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT

We can see the filtering effect by noting that the baseline data frame had r nrow(baseline_cip) observations while the inner join on the selected programs returned r nrow(DT) observations. We can also verify the selected programs, e.g.,

# Verify program labels
sort(unique(DT$program))

# Verify program CIP codes
sort(unique(DT$cip6))

Students can migrate between majors having different 6-digit CIP codes, but those codes might be grouped under a single program label. A common example in Engineering is the "Industrial/Systems Engineering" label we assign to following CIP codes:

A student migrating among these CIPs would appear in multiple rows in the current bloc, yet we would not consider their change of CIP a change of major. The next step addresses this anomaly.

Select.   Drop the CIP code.

# Prepare to filter
DT[, cip6 := NULL]
DT

Filter.   Filter for unique observations.

# Case study ever enrolled
DT <- unique(DT)
DT

The difference in the number of observations indicates that this example includes one student in ISE with two CIPs. We examine that student's record below.

Closer look

#| echo: false
# find student ID with two CIPs in ISE
# x <- study_programs[term, on = "cip6", nomatch = NULL]
# x <- x[program == "ISE"]
# x <- x[, .(mcid, cip6, program)]
# x <- unique(x)
# x[duplicated(x$mcid)]
# x[mcid == "MCID3111251565"]

Examining the records of selected students in detail.

Example 1.   This student's term record includes CIP 142701 (Systems Engineering) and CIP 143501 (Industrial Engineering), both of which are majors in our combined Industrial/Systems Engineering (ISE) major, illustrating our rationale for filtering for unique observations by ID/program pairs and not ID/CIP pairs.

# All terms, one ID
x <- term[mcid == "MCID3111251565", .(mcid, cip6)]

# Join case study program labels
x <- study_programs[x, on = c("cip6"), nomatch = NULL]

# Unique CIPs for this student
unique(x)

Demographics

Demographic variables (race/ethnicity and sex) are regularly left-joined to blocs for grouping and summarizing.

Reset   The data frame of baseline IDs is the intake for this section.

# Reusable starting state
DT <- copy(baseline_mcid)
DT

Select.   From student, select the join-by variable (student ID) and the variables we want to add. By selecting columns here, we don't have to select columns in the join operation to follow.

# Extract desired columns
cols_we_want <- student[, .(mcid, race, sex)]
cols_we_want

Add variables.   Left join two data frames, retaining all variables from both.

# Add demographics
DT <- cols_we_want[DT, on = c("mcid")]
DT

Unknown race or sex

We often want to remove records for which race/ethnicity or sex are "unknown".

# Display values
unique(DT$race)
unique(DT$sex)

Filter.   In data.table syntax, we can use x %ilike% pattern as a case-insensitive wrapper around grepl() to find matches and partial matches.

# Remove records with unknown sex, if any
x <- copy(DT)
x <- x[!sex %ilike% "unknown"]
x

Removing unknown race observations is similar.

# Remove records with unknown sex, if any
x <- x[!race %ilike% "unknown"]
x

Alternatively, these statements can be combined.

# Remove unknowns in either of two columns
DT <- DT[!(sex %ilike% "unknown" | race %ilike% "unknown")]

# Verify equivalence
check_equiv_frames(x, DT)

With "unknowns" removed, the race and sex values are:

sort(unique(DT$race))
sort(unique(DT$sex))

Add origin

origin is a demographic variable we use to distinguish "domestic" students from "international" students. The variable is a recoding of the race variable.

Add a variable.   Assuming that race/ethnicity "unknown" have been removed, we use a conditional assignment to create the "origin" variable.

# Two values for origin
x <- copy(DT)
x <- x[, origin := fifelse(race == "International", "International", "Domestic")]
x[]

With "unknowns" removed, the origin values are:

sort(unique(x$origin))

Add people

people is a demographic variable we use in many of our summaries. The variable combines the race and sex variables.

Add a variable.   We combine race/ethnicity and sex to create a grouping variable.

x <- copy(DT)
x <- x[, people := paste(race, sex)]
x

With "unknowns" removed, the people values are:

sort(unique(x$people))

Add people by origin

Combining the two ideas above, again assuming that the observations on unknown race/ethnicity and sex have been removed,

# Two values for origin
x <- copy(DT)
x <- x[, origin := fifelse(race == "International", "International", "Domestic")]

# Combine with sex
x[, people := paste(origin, sex)]

# Omit unnecessary variables
x <- x[, .(mcid, people)]
x

The possible people values are:

sort(unique(x$people))

Other variables

Depending on one's research question, any number of MIDFIELD variables might be used for grouping records. In this section we illustrate joining other variables from student, term, and degree to a working data frame.

We use the original source files copied earlier because some variables we want to use were removed when we applied select_required().

From student

Reset   Reset the working data frame.

# Reusable starting state
DT <- copy(baseline_mcid)
DT

The available variables in the source student data are:

# Variables in the practice data set
names(source_student)

Select.   Select our variables and the key (ID).

# Extract desired columns
cols_we_want <- source_student[, .(mcid, transfer, hours_transfer)]

Add variables.   Left join to add new columns.

# Add desired columns
cols_we_want[DT, on = c("mcid")]

From term

Reset   Reset the working data frame.

# Reusable starting state
DT <- copy(baseline_mcid)

The available variables in the source term data are:

# Variables in the practice data set
names(source_term)

Select.   Select our variables and the key (ID).

# Extract desired columns
cols_we_want <- source_term[, .(mcid, term, hours_term, gpa_term)]

Add variables.   Left join to add new columns.

# Add desired columns
cols_we_want[DT, on = c("mcid")]
#| echo: false

x <- cols_we_want[DT, on = c("mcid")]
y <- nrow(x)
z <- length(unique(x$mcid))

Rows in X with multiple matches in Y.   Consistent with the left join matching rules, students in enrolled in multiple terms will have multiple rows in the joined data frame. Thus this result has r y observations of r z unique students.

From degree

Reset   Reset the working data frame.

# Reusable starting state
DT <- copy(baseline_mcid)
DT

The available variables in the source degree data are:

# Variables in the practice data set
names(source_degree)

Select.   Select two variables and the key (ID).

# Extract desired columns
cols_we_want <- source_degree[, .(mcid, cip6, degree)]

Add variables.   Left join to add new columns.

# Add desired columns
cols_we_want[DT, on = c("mcid")]
#| echo: false
x <- cols_we_want[DT, on = c("mcid")]
y <- length(unique(x$mcid))
z <- x[!is.na(cip6)]
z <- z[, .(mcid)]
z <- length(unique(z$mcid))

Rows in X with no match in Y.   Consistent with the left join matching rules, students in DT who do not graduate will have NA values in the term_degree and cip6 columns of the joined data tables. Thus this result has r y unique students of whom r z earned degrees.

Reusable code

Program labels preparation.   The CIP baseline data frame we preserved earlier is the intake for this section.

DT <- copy(baseline_cip)

Program labels.   A summary code chunk for ready reference. In gathering a bloc of starters, the join-by variable might be start instead of cip6.

# Filter by program
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT[, cip6 := NULL]
DT <- unique(DT)

Demographics preparation.   The data frame of baseline IDs is the intake for this section.

DT <- copy(baseline_mcid)

Demographics.   A summary code chunk for ready reference.

# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]

References




MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.