knitr::opts_chunk$set(
  collapse = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.height = 6,
  fig.width = 7,
#  fig.path = "fig/tidycat-",
  dev = "png",
  comment = "##"
)
library(MASS)
library(vcdExtra)

Frequency tables need some Tidy Love ❤️

Tidy methods for quantitative data & models have advanced considerably, but there hasn't been much development of similar ideas for "categorical data", by which I mean data that is often compactly represented as $n$-way frequency tables, cross classified by one or more discrete factors.

What would it take to implement a tidy framework for such data? These notes are, in effect, a call for participation in developing a tidyCat package for this purpose. Other possible names for this: tidyCDA, tidyfreq ...

I see three areas that could be developed here:

Constructing categorical data sets

Current non-tidy data forms and operations, following [@FriendlyMeyer:2016:DDAR] are described in the vignette Creating and manipulating frequency tables

It seems clear that the most flexible and general form, and one that most closely matches a tidy data frame is case form, because this allows for numeric variables as well. Thus:

Among these, tidy categorical data should best be represented as a tibble in case form. A tibble is convenient for its' printing.

Manipulating categorical data sets

The methods xtabs(), table() and expand.dft() described in that vignette allow conversion from one form to another. Tidy equivalents might be:

data("HairEyeColor")
hec.df <- as.data.frame(HairEyeColor)
head(hec.df)
# expand to case form
expand.dft(hec.df) |> head()
structable(Titanic)
structable(Sex + Class ~ Survived + Age, data = Titanic)

and there are a suite of methods for indexing and selecting parts of an $n$-way table.

Manipulating factor levels

Also needed:

data(Glass, package="vcdExtra")
str(Glass)
(glass.tab <- xtabs(Freq ~ father + son, data=Glass))

This can be reordered manually by indexing, to arrange the categories by status, giving an order Professional down to Unskilled:

# reorder by status
ord <- c(2, 1, 4, 3, 5) 
glass.tab[ord, ord]

A more general method is to permute the row and column categories in the order implied by correspondence analysis dimensions. This is implemented in the seriation package using the CA method of seriation::seriate() and applying permute() to the result.

library(seriation)
order <- seriate(glass.tab, method = "CA")
# the permuted row and column labels
rownames(glass.tab)[order[[1]]]

# reorder rows and columns
permute(glass.tab, order)

What are tidy ways to do these things?

Models

The standard analysis of frequency data is in the form of loglinear models fit by MASS::loglm() or with more flexible versions fit with glm() or gnm::gnm(). These are essentially linear models for the log of frequency. In vcdExtra, there are several methods for a list of such models, of class "glmlist" and "loglmlist", and these should be accommodated in tidy methods.

What is needed are broom methods for loglm models. The information required is accessible from standard functions, but not in a tidy form.

hec.indep <- loglm(~Hair+Eye+Sex, data=HairEyeColor)
hec.indep
# extract test statistics
summary(hec.indep)$tests
LRstats(hec.indep)
coef(hec.indep)
fitted(hec.indep)
residuals(hec.indep)

What about hatvalues? Not implemented, but shouldn't be too hard.

hatvalues(hec.indep)

Graphical methods

The most common graphical methods are those implemented in vcd: mosaic() association plots (assoc()), ..., which rely on vcd::strucplot() described in [@vcd:Meyer+Zeileis+Hornik:2006b].

Is there a tidy analog that might work with ggplot2? The ggmosaic package implements basic marimeko-style mosaic plots. They are not very general, in that they cannot do residual-based shading to show the patterns of association.

However, they are based on a productplots package by Hadley, which seems to provide some basic structure for constructing such displays of nested rectangles.

References



friendly/vcdExtra documentation built on June 12, 2025, 4:18 p.m.