In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-120-")

In working with longitudinal student-level records, we regularly encounter data structured as multiway data. We explore that data visually using multiway dot plots as described by William Cleveland [-@Cleveland:1993, 302--306]. Quotations, unless noted otherwise, are from this source.

Note that "multiway" in our context refers to the data structure and chart design defined by Cleveland, not to the methods of analysis described by Kroonenberg [-@Kroonenberg:2008].

This vignette in the MIDFIELD workflow.

Planning
Initial processing
Blocs
Groupings
Metrics
[Displays]{.accent}
- [Multiway charts]{.accent}
- [Tables]{.accent}

Definitions

multiway superposition

: Multiway data can be extended to include a third category of p levels; the quantitative response has length mnp, one for each combination of levels of three categories; the rows and panels encode the first two categories as usual; p data markers encode the third category on each row. Clarity usually requires that p = 2 but not more.

Method

We start with the results data frame from the Case study: Results vignette, containing data from four engineering programs (Civil, Electrical, Industrial/Systems, and Mechanical Engineering) grouped by program, race/ethnicity, and sex. These data have been filtered for data sufficiency, degree seeking, and program, and graduates are filtered for timely completion.

We prepare the data for use as input to order_multiway() and use the results to construct multiway charts ordered by category median values and by category percentage values.

Load data

Start. If you are writing your own script to follow along, we use these packages in this article:

library("midfieldr")
library("data.table")
library("ggplot2")

Loads with midfieldr. Prepared data. View data dictionary via ?study_results.

study_results (derived in Stickiness).

Initial processing

Initialize. Assign a working data frame.

# Working data frame
DT <- copy(study_results)

Filter. Human subject privacy is potentially at risk for small populations even with anonymized observations. Therefore, before tabulating or graphing the data for dissemination, we omit observations with fewer than 10 graduates. The magnitude of the bound (graduates >= 10) can vary depending on one's data.

# Protecting privacy of small populations
DT <- DT[graduates >= 10]

Preparing the categorical variables

Before we apply the order_multiway() function, we edit the categorical variables to create the forms we want in the final charts or tables.

Recode. The first multiway categorical variable is program. To improve the readability of the charts, we recode the program abbreviations.

# Recode for panel and row labels
DT[, program := fcase(
  program %like% "CE", "Civil",
  program %like% "EE", "Electrical",
  program %like% "ME", "Mechanical",
  program %like% "ISE", "Industrial/Systems"
)]

Create a variable. We combine race and sex into a single categorical variable (denoted people) as our second, independent categorical variable.

# Create a new category
DT[, people := paste(race, sex)]
setcolorder(DT, c("program", "people", "race", "sex"))
DT

At this point, the multiway categories (programs and people) are "character" class.

`order_multiway()`

Converts the categorical variables to factors ordered by the quantitative variable.

Arguments.

dframe Data frame with multiway data in columns. Two additional numeric columns required when using the percentage ordering method.
quantity Name (in quotes) of the single multiway quantitative variable.
categories Vector of names (in quotes) of the two multiway categorical variables.
method “median” (default) or “percent”, method of ordering the levels of the categories. Argument to be used by name.
ratio_of Vector with the names (in quotes) of the numerator and denominator columns that produced the quantitative variable, required when using percentage ordering method. Argument to be used by name.

Equivalent usage. The following implementations yield identical results,

# Required arguments in order and explicitly named
x <- order_multiway(
  dframe = DT,
  quantity = "stickiness",
  categories = c("program", "people"),
  method = "median"
)

# Required arguments in order, but not named
y <- order_multiway(DT, "stickiness", c("program", "people"), method = "median")

# Using the implicit default for method
z <- order_multiway(DT, "stickiness", c("program", "people"))

# Demonstrate equivalence
check_equiv_frames(x, y)
check_equiv_frames(x, z)

Output. Adds two columns to the data frame containing the computed values that determine the ordering of factors. The column names and values depend on the ordering method:

method = "median" Yields medians of the quantitative variable grouped by the categorical variables.
method = "percent" Yields percentages based on the same ratio that produces the quantitative variable but grouped by the categorical variables.

Median-ordered data

For this example, we select the count of graduates (graduates) as our quantitative variable and use order_multiway() to order the categories by median numbers of graduates.

To minimize the number of columns in the printout, we select the three multiway variables and drop other columns.

# Select multiway variables when quantity is count
DT_count <- copy(DT)
DT_count <- DT_count[, .(program, people, graduates)]
DT_count

Applying order_multiway(), we specify "graduates" as the quantitative column, "program" and "people" as the two categorical columns, and "median" as the method of ordering levels.

# Convert categories to factors ordered by median
DT_count <- order_multiway(DT_count,
  quantity = "graduates",
  categories = c("program", "people"),
  method = "median"
)
DT_count

The function adds two columns (program_median and people_median) to display the computed median values used to order the factors. In the median method, the new column names are a combination of the category variable names (from categories) plus median.

For example, the results show that the median number of Civil Engineering graduates is r unique(DT_count[program == "Civil", (program_median)]) and that the median number of Asian Female graduates is r unique(DT_count[people == "Asian Female", (people_median)]). We confirm these results by computing the median values independently.

The following values agree with those in the program_median variable above,

# Verify order_multiway() output
temp <- DT_count[, lapply(.SD, median), .SDcols = c("graduates"), by = c("program")]
temp

And the next result agrees with the values in people_median.

# Verify order_multiway() output
temp <- DT_count[, lapply(.SD, median), .SDcols = c("graduates"), by = c("people")]
temp

Below we demonstrate that both categories are "factor" class: program is a factor with r nlevels(DT_count[, program]) levels; people is a factor with r nlevels(DT_count[, people]) levels; and neither is ordered alphabetically---ordering is by increasing median value as expected.

# Verify first category is a factor
class(DT_count$program)
levels(DT_count$program)

# Verify second category is a factor
class(DT_count$people)
levels(DT_count$people)

Median-ordered charts

We use conventional ggplot2 functions to create the multiway graphs.

We create a set of axis labels and scale specifications for a series of median-ordered charts. We use a logarithmic scale in this case because the numbers span three orders of magnitude.

# Common x-scale and axis labels for median-ordered charts
common_scale_x_log10 <- scale_x_log10(
  limits = c(3, 1000),
  breaks = c(3, 10, 30, 100, 300, 1000),
  minor_breaks = c(seq(3, 10, 1), seq(20, 100, 10), seq(200, 1000, 100))
)
common_labs <- labs(
  x = "Number of graduates (log base 10 scale)",
  y = "",
  title = "Engineering graduates"
)
ref_line_color <- "gray60"

The first of two multiway charts encodes programs by rows and people by panels. The as.table = FALSE argument places rows and panels in "graphical order", that is, increasing from left to right and from bottom to top. The panel median value is drawn as a vertical reference line in each panel.

#| label: fig01
#| fig.asp: 0.8
#| fig-cap: "Figure 1. Rows and columns ordered by median values."

# Two columns of panels
ggplot(DT_count, aes(x = graduates, y = program)) +
  facet_wrap(vars(people), ncol = 2, as.table = FALSE) +
  geom_vline(aes(xintercept = people_median), linetype = 2, color = ref_line_color) +
  common_scale_x_log10 +
  common_labs +
  geom_point()

The programs are assigned to rows such that the program medians increase from bottom to top. Industrial/Systems has the smallest median; Mechanical Engineering the largest.

We drew the chart above in two columns to illustrate the graph order of panels. Asian Female students have the smallest median number of graduates, followed by International Female, Other/Unknown Male, Black Male, etc.

When space permits, however, laying out the panels in a single column can be useful for seeing effects. Here, we redraw the panels in one column.

#| label: fig02
#| fig-asp: 1.3
#| fig-cap: "Figure 2. Redraw the panels in one column."

# Programs encoded by rows
ggplot(DT_count, aes(x = graduates, y = program)) +
  facet_wrap(vars(people), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = people_median), linetype = 2, color = ref_line_color) +
  common_scale_x_log10 +
  common_labs +
  geom_point()

Reading a multiway graph

We can more effectively compare values within a panel than between panels.
Because rows are ordered, one expects a generally increasing trend within a panel. A response greater or smaller than expected creates a visual asymmetry. The interesting stories are often in these visual anomalies.

For example, the White Female panel shows a clear separation between two groupings of majors, Mechanical and Civil compared to Electrical and Industrial/Systems.

However, this chart does not permit us to effectively compare the eight values for a given program. For that we create a second multiway in which we switch the aesthetic roles of the categories---in this example by encoding people by rows and programs by panels.

#| label: fig03
#| fig-asp: 1.1
#| fig-cap: "Figure 3. Switching the row and column assignments of categorical variables."

# People encoded by rows
ggplot(DT_count, aes(x = graduates, y = people)) +
  facet_wrap(vars(program), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = program_median), linetype = 2, color = ref_line_color) +
  common_scale_x_log10 +
  common_labs +
  geom_point()

In this chart, the visual asymmetry that stands out most is Electrical Engineering, White Female, low given their overall rank.

Avoid alphabetical order

In the next figure, the same data are plotted in alphabetical order, which reveals none of the effects seen in the previous chart. An ordering scheme based on the values of the quantitative variable is necessary if a multiway chart is to reveal how the response is affected by the categories.

#| label: fig04
#| fig-asp: 1.1
#| fig-cap: "Figure 4. Alphabetical ordering conceals patterns in the data."

# Create alphabetical ordering
DT_alpha <- copy(DT)
DT_alpha[, people := factor(people, levels = sort(unique(people), decreasing = TRUE))]

# People encoded by rows, alphabetically
ggplot(DT_alpha, aes(x = graduates, y = people)) +
  facet_wrap(vars(program), ncol = 1, as.table = TRUE) +
  common_scale_x_log10 +
  common_labs +
  geom_point()

Multiway superposition

To illustrate superposing data, we return to the data set with separate columns for race/ethnicity and sex. Let's use graduates as our quantitative variable and omit unnecessary variables.

# Select multiway variables with a superposed category
DT_count <- copy(DT)
DT_count <- DT_count[, .(program, race, sex, graduates)]
DT_count

The superposed category is sex. The multiway data to be conditioned are graduates, the quantitative variable, and program and race, the two categorical variables.

# Convert categories to factors ordered by median
DT_count <- order_multiway(DT_count,
  quantity = "graduates",
  categories = c("program", "race")
)
DT_count

In this example, program and race are factors, ordered by median number of graduates while sex remains an unordered character variable.

Using conventional ggplot syntax, the aesthetics include x and y as before. We superpose data markers for sex in rows by assigning color = sex inside the aes() function.

#| label: fig05
#| fig-asp: 0.8
#| fig-cap: "Figure 5. Using superposition to display three categories."

# Race/ethnicity encoded by rows, sex superposed
ggplot(DT_count, aes(x = graduates, y = race, color = sex)) +
  facet_wrap(vars(program), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = program_median), linetype = 2, color = ref_line_color) +
  common_scale_x_log10 +
  common_labs +
  geom_point(size = 2) +
  scale_color_manual(values = c("#004488", "#DDAA33"))

By superposing data by sex, we facilitate a direct comparison of Male and Female students within a program and by race.

Swapping rows and panels yields the next chart, in which we can directly compare Male and Female students within their race/ethnicity category across programs. Because men tend to outnumber women in engineering programs, this chart clearly shows clusters by sex.

#| label: fig06
#| fig-asp: 0.9
#| fig-cap: "Figure 6. Switching the row and column assignments of two categorical variables."

# Program encoded by rows, sex superposed
ggplot(DT_count, aes(x = graduates, y = program, color = sex)) +
  facet_wrap(vars(race), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = race_median), linetype = 2, color = ref_line_color) +
  common_scale_x_log10 +
  common_labs +
  geom_point(size = 2) +
  scale_color_manual(values = c("#004488", "#DDAA33"))

Percentage-ordered data

For persistence metrics such as stickiness or graduation rate, the quantitative variable is a ratio or percentage. Here, we return to the original case study results and select stickiness (stickiness) as the quantitative variable.

# Select multiway variables when quantity is a percentage
options(datatable.print.topn = 3)
DT_ratio <- copy(DT)
DT_ratio[, c("race", "sex") := NULL]
DT_ratio

Because stickiness is a ratio, we set method to "percent" and assign graduates and ever_enrolled to the ratio_of argument. order_multiway() then sums the ever_enrolled and graduates counts by category and produces grouped percentages to order the category levels.

# Convert categories to factors ordered by group percentages
DT_ratio <- order_multiway(DT_ratio,
  quantity = "stickiness",
  categories = c("program", "people"),
  method = "percent",
  ratio_of = c("graduates", "ever_enrolled")
)
DT_ratio

The function again converts the categories to factors and adds two columns (program_stickiness and people_stickiness) to display the computed percentages used to order the factors. In the percentage method, the new column names are a combination of the category variable names (from categories) plus the quantitative column name (from x).

For example, the results show that the stickiness of Civil Engineering (program_stickiness) is r unique(DT_ratio[program == "Civil", (program_stickiness)])%, and of Asian Females, r unique(DT_ratio[people == "Asian Female", (people_stickiness)])% (people_stickiness). We confirm these results by computing the group stickiness values independently.

The following values agree with those in the program_stickiness variable above,

# Verify order_multiway() output
temp <- DT[, lapply(.SD, sum), .SDcols = c("ever_enrolled", "graduates"), by = c("program")]
temp[, stickiness := round(100 * graduates / ever_enrolled, 1)]
temp

And the next result agrees with the values in people_stickiness.

# Verify order_multiway() output
temp <- DT[, lapply(.SD, sum), .SDcols = c("ever_enrolled", "graduates"), by = c("people")]
temp[, stickiness := round(100 * graduates / ever_enrolled, 1)]
temp

Percentage-ordered charts

Here the quantitative variable is group stickiness. The first chart encodes programs by rows and people by panels. Row-order is determined by program stickiness computed over all students; panel order is determined by people stickiness computed over all programs.

The order of rows and panels has changed from the earlier charts.

#| label: fig07
#| fig.asp: 1.3
#| fig-cap: "Figure 7. Rows and column ordered by percentages."

# Programs encoded by rows
ggplot(DT_ratio, aes(x = stickiness, y = program)) +
  facet_wrap(vars(people), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = people_stickiness), linetype = 2, color = ref_line_color) +
  labs(x = "Stickiness", y = "", title = "Engineering stickiness") +
  geom_point()

The visual asymmetries in this chart that stand out are

Industrial/Systems, Asian Male, low stickiness given given the program's overall rank.
Civil, White Female, low stickiness given the program's overall rank.

Again, we cannot compare the eight values for a given program as effectively. This is done far better in the second chart that encodes people by rows and programs by panels.

#| label: fig08
#| fig.asp: 1.0
#| fig-cap: "Figure 8. Switching the row and column assignments of categorical variables."

# People encoded by rows
ggplot(DT_ratio, aes(x = stickiness, y = people)) +
  facet_wrap(vars(program), ncol = 1, as.table = FALSE) +
  geom_vline(aes(xintercept = program_stickiness), linetype = 2, color = ref_line_color) +
  labs(x = "Stickiness", y = "", title = "Engineering stickiness") +
  geom_point()

This chart shows a lot of variability. The visual asymmetries that stand out are

Asian Female, Mechanical Engineering, high given the group's overall rank
Asian Male and Female contrast, Civil

Tabulating counts

Readers and reviewers of charts often want to see the exact numbers represented by data markers. To serve that need, we tabulate multiway data after transforming it from block-record form (convenient for use with ggplot2) to row-record form---that is, from "long" to "wide" form.

To illustrate, let's tabulate the number of graduates by people and program. Start by selecting the desired variables only.

# Select the desired variables
tbl <- copy(DT)
tbl <- tbl[, .(program, people, graduates)]
tbl

Use dcast() to transform the block records to row records.

# Transform shape to row-record form
tbl <- dcast(tbl, people ~ program, value.var = "graduates")
tbl

Edit one column name and print the table.

# Edit column header
setnames(tbl, old = "people", new = "Group", skip_absent = TRUE)

#| echo: false
library(gt)
tbl |>
  gt() |>
  tab_caption("Table 1: Number of engineering graduates") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )

Multiway data structure lends itself to tables of this type. The levels of one category are in the first column; the levels of the second category are in the table header; and the quantitative variable fills the cells---a response value for each combination of levels of the two categories.

Tabulating percentages

When tabulating percentages, readers and reviewers are likely to want the percentage values as well as the underlying ratios of integers. In this example, we suggest one way these values can be presented in a single table.

# Select the desired variables
tbl <- copy(DT)
tbl <- tbl[, .(program, people, graduates, ever_enrolled, stickiness)]
tbl

In this step, we concatenate a character string with the number of students ever enrolled in parentheses followed by the percentage stickiness e.g., (16) 56.2.

# Construct new cell values
tbl[, results := paste0("\u0028", ever_enrolled, "\u0029", "\u00A0", round(stickiness, 1), "%")]
tbl

Now we can perform the transformation from block records to row records as we did above.

# Transform shape to row-record form
tbl <- dcast(tbl, people ~ program, value.var = "results", fill = NA_character_)
tbl

Edit one column name and print the table.

# Edit column header
setnames(tbl, old = "people", new = "Group", skip_absent = TRUE)

#| echo: false
tbl |>
  gt() |>
  tab_caption("Table 2: Four programs (N ever enrolled) percent stickiness") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )

References

MIDFIELDR/midfieldr documentation built on Jan. 28, 2025, 10:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Preparing the categorical variables

`order_multiway()`

Median-ordered data

Median-ordered charts

Avoid alphabetical order

Multiway superposition

Percentage-ordered data

Percentage-ordered charts

Tabulating counts

Tabulating percentages

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr Tools and Methods for Working with MIDFIELD Data in 'R'

In MIDFIELDR/midfieldr: Tools and Methods for Working with MIDFIELD Data in 'R'

Definitions

Method

Load data

Initial processing

Preparing the categorical variables

order_multiway()

Median-ordered data

Median-ordered charts

Avoid alphabetical order

Multiway superposition

Percentage-ordered data

Percentage-ordered charts

Tabulating counts

Tabulating percentages

References

R Package Documentation

Browse R Packages

We want your feedback!

MIDFIELDR/midfieldr
Tools and Methods for Working with MIDFIELD Data in 'R'

`order_multiway()`