#| include: false knitr::opts_chunk$set(fig.path = "../man/figures/art-120-")
In working with longitudinal student-level records, we regularly encounter data structured as multiway data. We explore that data visually using multiway dot plots as described by William Cleveland [-@Cleveland:1993, 302--306]. Quotations, unless noted otherwise, are from this source.
Note that "multiway" in our context refers to the data structure and chart design defined by Cleveland, not to the methods of analysis described by Kroonenberg [-@Kroonenberg:2008].
This vignette in the MIDFIELD workflow.
multiway superposition
: Multiway data can be extended to include a third category of p levels; the quantitative response has length mnp, one for each combination of levels of three categories; the rows and panels encode the first two categories as usual; p data markers encode the third category on each row. Clarity usually requires that p = 2 but not more.
We start with the results data frame from the Case study: Results vignette, containing data from four engineering programs (Civil, Electrical, Industrial/Systems, and Mechanical Engineering) grouped by program, race/ethnicity, and sex. These data have been filtered for data sufficiency, degree seeking, and program, and graduates are filtered for timely completion.
We prepare the data for use as input to order_multiway()
and use the results to construct multiway charts ordered by category median values and by category percentage values.
Start. If you are writing your own script to follow along, we use these packages in this article:
library("midfieldr") library("data.table") library("ggplot2")
Loads with midfieldr. Prepared data. View data dictionary via ?study_results
.
study_results
(derived in Stickiness). Initialize. Assign a working data frame.
# Working data frame DT <- copy(study_results)
Filter. Human subject privacy is potentially at risk for small populations even with anonymized observations. Therefore, before tabulating or graphing the data for dissemination, we omit observations with fewer than 10 graduates. The magnitude of the bound (graduates >= 10
) can vary depending on one's data.
# Protecting privacy of small populations DT <- DT[graduates >= 10]
Before we apply the order_multiway()
function, we edit the categorical variables to create the forms we want in the final charts or tables.
Recode. The first multiway categorical variable is program
. To improve the readability of the charts, we recode the program abbreviations.
# Recode for panel and row labels DT[, program := fcase( program %like% "CE", "Civil", program %like% "EE", "Electrical", program %like% "ME", "Mechanical", program %like% "ISE", "Industrial/Systems" )]
Create a variable. We combine race
and sex
into a single categorical variable (denoted people
) as our second, independent categorical variable.
# Create a new category DT[, people := paste(race, sex)] setcolorder(DT, c("program", "people", "race", "sex")) DT
At this point, the multiway categories (programs
and people
) are "character" class.
order_multiway()
Converts the categorical variables to factors ordered by the quantitative variable.
Arguments.
dframe
Data frame with multiway data in columns. Two additional numeric columns required when using the percentage ordering method.
quantity
Name (in quotes) of the single multiway quantitative variable.
categories
Vector of names (in quotes) of the two multiway categorical variables.
method
“median” (default) or “percent”, method of ordering the levels of the categories. Argument to be used by name.
ratio_of
Vector with the names (in quotes) of the numerator and denominator columns that produced the quantitative variable, required when using percentage ordering method. Argument to be used by name.
Equivalent usage. The following implementations yield identical results,
# Required arguments in order and explicitly named x <- order_multiway( dframe = DT, quantity = "stickiness", categories = c("program", "people"), method = "median" ) # Required arguments in order, but not named y <- order_multiway(DT, "stickiness", c("program", "people"), method = "median") # Using the implicit default for method z <- order_multiway(DT, "stickiness", c("program", "people")) # Demonstrate equivalence check_equiv_frames(x, y) check_equiv_frames(x, z)
Output. Adds two columns to the data frame containing the computed values that determine the ordering of factors. The column names and values depend on the ordering method:
method = "median"
Yields medians of the quantitative variable grouped by the categorical variables.
method = "percent"
Yields percentages based on the same ratio that produces the quantitative variable but grouped by the categorical variables.
For this example, we select the count of graduates (graduates
) as our quantitative variable and use order_multiway()
to order the categories by median numbers of graduates.
To minimize the number of columns in the printout, we select the three multiway variables and drop other columns.
# Select multiway variables when quantity is count DT_count <- copy(DT) DT_count <- DT_count[, .(program, people, graduates)] DT_count
Applying order_multiway()
, we specify "graduates"
as the quantitative column, "program"
and "people"
as the two categorical columns, and "median"
as the method of ordering levels.
# Convert categories to factors ordered by median DT_count <- order_multiway(DT_count, quantity = "graduates", categories = c("program", "people"), method = "median" ) DT_count
The function adds two columns (program_median
and people_median
) to display the computed median values used to order the factors. In the median method, the new column names are a combination of the category variable names (from categories
) plus median
.
For example, the results show that the median number of Civil Engineering graduates is r unique(DT_count[program == "Civil", (program_median)])
and that the median number of Asian Female graduates is r unique(DT_count[people == "Asian Female", (people_median)])
. We confirm these results by computing the median values independently.
The following values agree with those in the program_median
variable above,
# Verify order_multiway() output temp <- DT_count[, lapply(.SD, median), .SDcols = c("graduates"), by = c("program")] temp
And the next result agrees with the values in people_median
.
# Verify order_multiway() output temp <- DT_count[, lapply(.SD, median), .SDcols = c("graduates"), by = c("people")] temp
Below we demonstrate that both categories are "factor" class: program
is a factor with r nlevels(DT_count[, program])
levels; people
is a factor with r nlevels(DT_count[, people])
levels; and neither is ordered alphabetically---ordering is by increasing median value as expected.
# Verify first category is a factor class(DT_count$program) levels(DT_count$program) # Verify second category is a factor class(DT_count$people) levels(DT_count$people)
We use conventional ggplot2 functions to create the multiway graphs.
We create a set of axis labels and scale specifications for a series of median-ordered charts. We use a logarithmic scale in this case because the numbers span three orders of magnitude.
# Common x-scale and axis labels for median-ordered charts common_scale_x_log10 <- scale_x_log10( limits = c(3, 1000), breaks = c(3, 10, 30, 100, 300, 1000), minor_breaks = c(seq(3, 10, 1), seq(20, 100, 10), seq(200, 1000, 100)) ) common_labs <- labs( x = "Number of graduates (log base 10 scale)", y = "", title = "Engineering graduates" ) ref_line_color <- "gray60"
The first of two multiway charts encodes programs by rows and people by panels. The as.table = FALSE
argument places rows and panels in "graphical order", that is, increasing from left to right and from bottom to top. The panel median value is drawn as a vertical reference line in each panel.
#| label: fig01 #| fig.asp: 0.8 #| fig-cap: "Figure 1. Rows and columns ordered by median values." # Two columns of panels ggplot(DT_count, aes(x = graduates, y = program)) + facet_wrap(vars(people), ncol = 2, as.table = FALSE) + geom_vline(aes(xintercept = people_median), linetype = 2, color = ref_line_color) + common_scale_x_log10 + common_labs + geom_point()
The programs are assigned to rows such that the program medians increase from bottom to top. Industrial/Systems has the smallest median; Mechanical Engineering the largest.
We drew the chart above in two columns to illustrate the graph order of panels. Asian Female students have the smallest median number of graduates, followed by International Female, Other/Unknown Male, Black Male, etc.
When space permits, however, laying out the panels in a single column can be useful for seeing effects. Here, we redraw the panels in one column.
#| label: fig02 #| fig-asp: 1.3 #| fig-cap: "Figure 2. Redraw the panels in one column." # Programs encoded by rows ggplot(DT_count, aes(x = graduates, y = program)) + facet_wrap(vars(people), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = people_median), linetype = 2, color = ref_line_color) + common_scale_x_log10 + common_labs + geom_point()
Reading a multiway graph
For example, the White Female panel shows a clear separation between two groupings of majors, Mechanical and Civil compared to Electrical and Industrial/Systems.
However, this chart does not permit us to effectively compare the eight values for a given program. For that we create a second multiway in which we switch the aesthetic roles of the categories---in this example by encoding people by rows and programs by panels.
#| label: fig03 #| fig-asp: 1.1 #| fig-cap: "Figure 3. Switching the row and column assignments of categorical variables." # People encoded by rows ggplot(DT_count, aes(x = graduates, y = people)) + facet_wrap(vars(program), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = program_median), linetype = 2, color = ref_line_color) + common_scale_x_log10 + common_labs + geom_point()
In this chart, the visual asymmetry that stands out most is Electrical Engineering, White Female, low given their overall rank.
In the next figure, the same data are plotted in alphabetical order, which reveals none of the effects seen in the previous chart. An ordering scheme based on the values of the quantitative variable is necessary if a multiway chart is to reveal how the response is affected by the categories.
#| label: fig04 #| fig-asp: 1.1 #| fig-cap: "Figure 4. Alphabetical ordering conceals patterns in the data." # Create alphabetical ordering DT_alpha <- copy(DT) DT_alpha[, people := factor(people, levels = sort(unique(people), decreasing = TRUE))] # People encoded by rows, alphabetically ggplot(DT_alpha, aes(x = graduates, y = people)) + facet_wrap(vars(program), ncol = 1, as.table = TRUE) + common_scale_x_log10 + common_labs + geom_point()
To illustrate superposing data, we return to the data set with separate columns for race/ethnicity and sex. Let's use graduates
as our quantitative variable and omit unnecessary variables.
# Select multiway variables with a superposed category DT_count <- copy(DT) DT_count <- DT_count[, .(program, race, sex, graduates)] DT_count
The superposed category is sex
. The multiway data to be conditioned are
graduates
, the quantitative variable, and program
and race
, the two categorical variables.
# Convert categories to factors ordered by median DT_count <- order_multiway(DT_count, quantity = "graduates", categories = c("program", "race") ) DT_count
In this example, program
and race
are factors, ordered by median number of graduates while sex
remains an unordered character variable.
Using conventional ggplot syntax, the aesthetics include x
and y
as before. We superpose data markers for sex in rows by assigning color = sex
inside the aes()
function.
#| label: fig05 #| fig-asp: 0.8 #| fig-cap: "Figure 5. Using superposition to display three categories." # Race/ethnicity encoded by rows, sex superposed ggplot(DT_count, aes(x = graduates, y = race, color = sex)) + facet_wrap(vars(program), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = program_median), linetype = 2, color = ref_line_color) + common_scale_x_log10 + common_labs + geom_point(size = 2) + scale_color_manual(values = c("#004488", "#DDAA33"))
By superposing data by sex, we facilitate a direct comparison of Male and Female students within a program and by race.
Swapping rows and panels yields the next chart, in which we can directly compare Male and Female students within their race/ethnicity category across programs. Because men tend to outnumber women in engineering programs, this chart clearly shows clusters by sex.
#| label: fig06 #| fig-asp: 0.9 #| fig-cap: "Figure 6. Switching the row and column assignments of two categorical variables." # Program encoded by rows, sex superposed ggplot(DT_count, aes(x = graduates, y = program, color = sex)) + facet_wrap(vars(race), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = race_median), linetype = 2, color = ref_line_color) + common_scale_x_log10 + common_labs + geom_point(size = 2) + scale_color_manual(values = c("#004488", "#DDAA33"))
For persistence metrics such as stickiness or graduation rate, the quantitative variable is a ratio or percentage. Here, we return to the original case study results and select stickiness (stickiness
) as the quantitative variable.
# Select multiway variables when quantity is a percentage options(datatable.print.topn = 3) DT_ratio <- copy(DT) DT_ratio[, c("race", "sex") := NULL] DT_ratio
Because stickiness is a ratio, we set method
to "percent" and assign graduates
and ever_enrolled
to the ratio_of
argument. order_multiway()
then sums the ever_enrolled
and graduates
counts by category and produces grouped percentages to order the category levels.
# Convert categories to factors ordered by group percentages DT_ratio <- order_multiway(DT_ratio, quantity = "stickiness", categories = c("program", "people"), method = "percent", ratio_of = c("graduates", "ever_enrolled") ) DT_ratio
The function again converts the categories to factors and adds two columns (program_stickiness
and people_stickiness
) to display the computed percentages used to order the factors. In the percentage method, the new column names are a combination of the category variable names (from categories
) plus the quantitative column name (from x
).
For example, the results show that the stickiness of Civil Engineering (program_stickiness
) is r unique(DT_ratio[program == "Civil", (program_stickiness)])
%, and of Asian Females, r unique(DT_ratio[people == "Asian Female", (people_stickiness)])
% (people_stickiness
). We confirm these results by computing the group stickiness values independently.
The following values agree with those in the program_stickiness
variable above,
# Verify order_multiway() output temp <- DT[, lapply(.SD, sum), .SDcols = c("ever_enrolled", "graduates"), by = c("program")] temp[, stickiness := round(100 * graduates / ever_enrolled, 1)] temp
And the next result agrees with the values in people_stickiness
.
# Verify order_multiway() output temp <- DT[, lapply(.SD, sum), .SDcols = c("ever_enrolled", "graduates"), by = c("people")] temp[, stickiness := round(100 * graduates / ever_enrolled, 1)] temp
Here the quantitative variable is group stickiness. The first chart encodes programs by rows and people by panels. Row-order is determined by program stickiness computed over all students; panel order is determined by people stickiness computed over all programs.
The order of rows and panels has changed from the earlier charts.
#| label: fig07 #| fig.asp: 1.3 #| fig-cap: "Figure 7. Rows and column ordered by percentages." # Programs encoded by rows ggplot(DT_ratio, aes(x = stickiness, y = program)) + facet_wrap(vars(people), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = people_stickiness), linetype = 2, color = ref_line_color) + labs(x = "Stickiness", y = "", title = "Engineering stickiness") + geom_point()
The visual asymmetries in this chart that stand out are
Again, we cannot compare the eight values for a given program as effectively. This is done far better in the second chart that encodes people by rows and programs by panels.
#| label: fig08 #| fig.asp: 1.0 #| fig-cap: "Figure 8. Switching the row and column assignments of categorical variables." # People encoded by rows ggplot(DT_ratio, aes(x = stickiness, y = people)) + facet_wrap(vars(program), ncol = 1, as.table = FALSE) + geom_vline(aes(xintercept = program_stickiness), linetype = 2, color = ref_line_color) + labs(x = "Stickiness", y = "", title = "Engineering stickiness") + geom_point()
This chart shows a lot of variability. The visual asymmetries that stand out are
Readers and reviewers of charts often want to see the exact numbers represented by data markers. To serve that need, we tabulate multiway data after transforming it from block-record form (convenient for use with ggplot2) to row-record form---that is, from "long" to "wide" form.
To illustrate, let's tabulate the number of graduates by people and program. Start by selecting the desired variables only.
# Select the desired variables tbl <- copy(DT) tbl <- tbl[, .(program, people, graduates)] tbl
Use dcast()
to transform the block records to row records.
# Transform shape to row-record form tbl <- dcast(tbl, people ~ program, value.var = "graduates") tbl
Edit one column name and print the table.
# Edit column header setnames(tbl, old = "people", new = "Group", skip_absent = TRUE)
#| echo: false library(gt) tbl |> gt() |> tab_caption("Table 1: Number of engineering graduates") |> tab_options(table.font.size = "small") |> opt_stylize(style = 1, color = "gray") |> tab_style( style = list(cell_fill(color = "#c7eae5")), locations = cells_column_labels(columns = everything()) )
Multiway data structure lends itself to tables of this type. The levels of one category are in the first column; the levels of the second category are in the table header; and the quantitative variable fills the cells---a response value for each combination of levels of the two categories.
When tabulating percentages, readers and reviewers are likely to want the percentage values as well as the underlying ratios of integers. In this example, we suggest one way these values can be presented in a single table.
# Select the desired variables tbl <- copy(DT) tbl <- tbl[, .(program, people, graduates, ever_enrolled, stickiness)] tbl
In this step, we concatenate a character string with the number of students ever enrolled in parentheses followed by the percentage stickiness e.g., (16) 56.2
.
# Construct new cell values tbl[, results := paste0("\u0028", ever_enrolled, "\u0029", "\u00A0", round(stickiness, 1), "%")] tbl
Now we can perform the transformation from block records to row records as we did above.
# Transform shape to row-record form tbl <- dcast(tbl, people ~ program, value.var = "results", fill = NA_character_) tbl
Edit one column name and print the table.
# Edit column header setnames(tbl, old = "people", new = "Group", skip_absent = TRUE)
#| echo: false tbl |> gt() |> tab_caption("Table 2: Four programs (N ever enrolled) percent stickiness") |> tab_options(table.font.size = "small") |> opt_stylize(style = 1, color = "gray") |> tab_style( style = list(cell_fill(color = "#c7eae5")), locations = cells_column_labels(columns = everything()) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.