In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-003-")

Part 3 of a case study in three parts, illustrating how we work with longitudinal student-level records.

Goals. Introducing the study.
Data. Transforming the data to yield the observations of interest.
Results. Summary statistics, metric, chart, and table.

Method

Our goal in this segment is to group and summarize the observations we saved previously, calculate the stickiness metric, and display the results.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library(midfieldr)
library(data.table)
library(ggplot2)

Loads with midfieldr. Prepared data. View data dictionary via ?study_observations.

study_observations (derived in Case study: Data).

Group and summarize

Initialize. Assign a working data frame.

# Working data frame
DT <- copy(study_observations)
DT

The study observations data frame is designed with one column for each grouping variable: race, sex, program, and bloc.

Summarize. Count the numbers of observations for each combination of the grouping variables.

# Group and summarize
DT <- DT[, .N, by = c("bloc", "program", "race", "sex")]
DT

Reshape

Reshape. Transform from block-record form to row-record form to set up the stickiness metric calculation. Transforms the N column into two columns, one for ever-enrolled and one for graduates. This operation is essentially a transformation from block records to row records---a process known by a number of different names, e.g., pivot, crosstab, unstack, spread, or widen [@Mount+Zumel:2019:fluid-data]. This step leaves the graphing variables (program, race/ethnicity, and sex) in place.

# Prepare to compute metric
DT <- dcast(DT, program + sex + race ~ bloc, value.var = "N", fill = 0)
DT

Compute the metric

[ S = \frac{N_g}{N_e} ]

Create a variable. Compute stickiness.

# Compute the metric
DT[, stickiness := round(100 * graduates / ever_enrolled, 1)]
setkey(DT, NULL)
DT

Verify built-in data. To avoid deriving this data frame each time it is needed in other articles, the same information is provided in the study_results data frame included with midfieldr. Here we verify that the two data frames are identical.

# Demonstrate equivalence
check_equiv_frames(DT, study_results)

Prepare for dissemination

We take several additional steps to prepare the data for dissemination in tables or charts.

Filtering. To preserve the anonymity of the people involved, we remove observations with fewer than 10 graduates.

# Preserve anonymity
DT <- DT[graduates >= 10]

# Display the result
DT

Filtering. Let us assume that our study focuses on "domestic" students of known race/ethnicity. In that case, we omit observations labeled "International" and Other/Unknown".

# Filter by study design
DT <- DT[!race %chin% c("International", "Other/Unknown")]

# Display the result
DT

Creating variables. We have found it useful to report such data with a variable that combines race/ethnicity and sex.

# Create a variable
DT[, people := paste(race, sex)]
DT[, c("race", "sex") := NULL]
setcolorder(DT, c("program", "people"))

# Display the result
DT

Recoding values. Readers can more readily interpret our charts and tables if the programs are unabbreviated.

# Recode values for charts and tables
DT[, program := fcase(
  program %like% "CE", "Civil",
  program %like% "EE", "Electrical",
  program %like% "ME", "Mechanical",
  program %like% "ISE", "Industrial/Systems"
)]

# Display the result
DT

With one quantitative variable (stickiness) for every combination of the levels of two categorical variables (program and race/ethnicity/sex), these data are multiway data [@Cleveland:1993]. How one orders the categorical variables is critical for visualizing effects.

Conditioning. Convert the two categorical variables to ordered factors to support the ordering of rows and panels in the chart.

# Convert categorical variables to factors
DT <- order_multiway(DT,
  quantity = "stickiness",
  categories = c("program", "people"),
  method = "percent",
  ratio_of = c("graduates", "ever_enrolled")
)

# Display the result
DT

The column program_stickiness determines the order of the programs in the chart; people_stickiness determines the order of the race/ethnicity/sex groupings; the values in stickiness are the quantitative values to be graphed.

Charts

In the first multiway chart, the rows are programs and panels are people, facilitating comparisons of different program for a single group. Rows and panels are both ordered from bottom to top in order of increasing stickiness.

#| label: fig01
#| fig-asp: 0.8
#| fig-cap: "Figure 1: Stickiness with programs on rows."

ggplot(DT, aes(x = stickiness, y = program)) +
  facet_wrap(vars(people), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = people_stickiness), linetype = 2, color = "gray60") +
  geom_point() +
  labs(x = "Stickiness (%)", y = "")

The vertical reference line is the overall stickiness of the people in a panel.

Alternatively, we can consider the dual chart, swapping the roles of the panels and rows. Here the rows are people and panels are programs, facilitating comparisons of different people within a program. Over many years of publishing research using MIDFIELD data, placing people on the rows of the multiway chart has been perhaps our most frequently used design.

#| label: fig02
#| fig-asp: 0.8
#| fig-cap: "Figure 1: Stickiness with programs in panels."

ggplot(DT, aes(x = stickiness, y = people)) +
  facet_wrap(vars(program), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = program_stickiness), linetype = 2, color = "gray60") +
  geom_point() +
  labs(x = "Stickiness (%)", y = "")

The vertical reference line is the overall stickiness of the program in a panel.

The chart illustrates the importance of ordering the rows and panels. We would conclude that Industrial/Systems Engineering is the stickiest program of the four, followed by Civil, Mechanical, and Electrical in descending order.

Because rows are ordered, one expects a generally increasing trend within a panel. A response greater or smaller than expected creates a visual asymmetry. For example, Asian students are asymmetrically lower in Industrial/Systems Engineering.

Tables

Data tables are often needed for publication. In this example, we format the data in a conventional row-record form with the groups of people in the first column labeling the rows and the program names labeling the remaining columns.

# Select the columns I want for the table
tbl <- DT[, .(program, people, stickiness)]

# Change factors to characters so rows/columns can be alphabetized
tbl[, people := as.character(people)]
tbl[, program := as.character(program)]

# Transform from block records to row records
tbl <- dcast(tbl, people ~ program, value.var = "stickiness")

# Edit one column header
setnames(tbl, old = "people", new = "People", skip_absent = TRUE)

Groups with numbers below our reporting threshold are denoted NA or omitted.

#| echo: false
library(gt)
tbl |>
  gt() |>
  tab_caption("Table 1. Progrm stickiness (%)") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )