#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-003-")
Part 3 of a case study in three parts, illustrating how we work with longitudinal student-level records.
Goals. Introducing the study.
Data. Transforming the data to yield the observations of interest.
Results. Summary statistics, metric, chart, and table.
Our goal in this segment is to group and summarize the observations we saved previously, calculate the stickiness metric, and display the results.
Start. If you are writing your own script to follow along, we use these packages in this article:
library(midfieldr) library(data.table) library(ggplot2)
Loads with midfieldr. Prepared data. View data dictionary via ?study_observations
.
study_observations
(derived in Case study: Data). Initialize. Assign a working data frame.
# Working data frame DT <- copy(study_observations) DT
The study observations data frame is designed with one column for each grouping variable: race
, sex
, program
, and bloc
.
Summarize. Count the numbers of observations for each combination of the grouping variables.
# Group and summarize DT <- DT[, .N, by = c("bloc", "program", "race", "sex")] DT
Reshape. Transform from block-record form to row-record form to set up the stickiness metric calculation. Transforms the N column into two columns, one for ever-enrolled and one for graduates. This operation is essentially a transformation from block records to row records---a process known by a number of different names, e.g., pivot, crosstab, unstack, spread, or widen [@Mount+Zumel:2019:fluid-data]. This step leaves the graphing variables (program, race/ethnicity, and sex) in place.
# Prepare to compute metric DT <- dcast(DT, program + sex + race ~ bloc, value.var = "N", fill = 0) DT
[ S = \frac{N_g}{N_e} ]
Create a variable. Compute stickiness.
# Compute the metric DT[, stickiness := round(100 * graduates / ever_enrolled, 1)] setkey(DT, NULL) DT
Verify built-in data. To avoid deriving this data frame each time it is needed in other articles, the same information is provided in the study_results
data frame included with midfieldr. Here we verify that the two data frames are identical.
# Demonstrate equivalence check_equiv_frames(DT, study_results)
We take several additional steps to prepare the data for dissemination in tables or charts.
Filtering. To preserve the anonymity of the people involved, we remove observations with fewer than 10 graduates.
# Preserve anonymity DT <- DT[graduates >= 10] # Display the result DT
Filtering. Let us assume that our study focuses on "domestic" students of known race/ethnicity. In that case, we omit observations labeled "International" and Other/Unknown".
# Filter by study design DT <- DT[!race %chin% c("International", "Other/Unknown")] # Display the result DT
Creating variables. We have found it useful to report such data with a variable that combines race/ethnicity and sex.
# Create a variable DT[, people := paste(race, sex)] DT[, c("race", "sex") := NULL] setcolorder(DT, c("program", "people")) # Display the result DT
Recoding values. Readers can more readily interpret our charts and tables if the programs are unabbreviated.
# Recode values for charts and tables DT[, program := fcase( program %like% "CE", "Civil", program %like% "EE", "Electrical", program %like% "ME", "Mechanical", program %like% "ISE", "Industrial/Systems" )] # Display the result DT
With one quantitative variable (stickiness) for every combination of the levels of two categorical variables (program and race/ethnicity/sex), these data are multiway data [@Cleveland:1993]. How one orders the categorical variables is critical for visualizing effects.
Conditioning. Convert the two categorical variables to ordered factors to support the ordering of rows and panels in the chart.
# Convert categorical variables to factors DT <- order_multiway(DT, quantity = "stickiness", categories = c("program", "people"), method = "percent", ratio_of = c("graduates", "ever_enrolled") ) # Display the result DT
The column program_stickiness
determines the order of the programs in the chart; people_stickiness
determines the order of the race/ethnicity/sex groupings; the values in stickiness
are the quantitative values to be graphed.
In the first multiway chart, the rows are programs and panels are people, facilitating comparisons of different program for a single group. Rows and panels are both ordered from bottom to top in order of increasing stickiness.
#| label: fig01 #| fig-asp: 0.8 #| fig-cap: "Figure 1: Stickiness with programs on rows." ggplot(DT, aes(x = stickiness, y = program)) + facet_wrap(vars(people), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = people_stickiness), linetype = 2, color = "gray60") + geom_point() + labs(x = "Stickiness (%)", y = "")
The vertical reference line is the overall stickiness of the people in a panel.
Alternatively, we can consider the dual chart, swapping the roles of the panels and rows. Here the rows are people and panels are programs, facilitating comparisons of different people within a program. Over many years of publishing research using MIDFIELD data, placing people on the rows of the multiway chart has been perhaps our most frequently used design.
#| label: fig02 #| fig-asp: 0.8 #| fig-cap: "Figure 1: Stickiness with programs in panels." ggplot(DT, aes(x = stickiness, y = people)) + facet_wrap(vars(program), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = program_stickiness), linetype = 2, color = "gray60") + geom_point() + labs(x = "Stickiness (%)", y = "")
The vertical reference line is the overall stickiness of the program in a panel.
The chart illustrates the importance of ordering the rows and panels. We would conclude that Industrial/Systems Engineering is the stickiest program of the four, followed by Civil, Mechanical, and Electrical in descending order.
Because rows are ordered, one expects a generally increasing trend within a panel. A response greater or smaller than expected creates a visual asymmetry. For example, Asian students are asymmetrically lower in Industrial/Systems Engineering.
Data tables are often needed for publication. In this example, we format the data in a conventional row-record form with the groups of people in the first column labeling the rows and the program names labeling the remaining columns.
# Select the columns I want for the table tbl <- DT[, .(program, people, stickiness)] # Change factors to characters so rows/columns can be alphabetized tbl[, people := as.character(people)] tbl[, program := as.character(program)] # Transform from block records to row records tbl <- dcast(tbl, people ~ program, value.var = "stickiness") # Edit one column header setnames(tbl, old = "people", new = "People", skip_absent = TRUE)
Groups with numbers below our reporting threshold are denoted NA or omitted.
#| echo: false library(gt) tbl |> gt() |> tab_caption("Table 1. Progrm stickiness (%)") |> tab_options(table.font.size = "small") |> opt_stylize(style = 1, color = "gray") |> tab_style( style = list(cell_fill(color = "#c7eae5")), locations = cells_column_labels(columns = everything()) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.