knitr::opts_chunk$set( collapse = TRUE, message = FALSE, warning = FALSE, fig.height = 6, fig.width = 7, # fig.path = "fig/tidycat-", dev = "png", comment = "##" )
library(MASS) library(vcdExtra)
Frequency tables need some Tidy Love ❤️
Tidy methods for quantitative data & models have advanced considerably, but there hasn't been much development of similar ideas for "categorical data", by which I mean data that is often compactly represented as $n$-way frequency tables, cross classified by one or more discrete factors.
What would it take to implement a tidy framework for such data? These notes are, in effect, a call for
participation in developing a tidyCat
package for this purpose. Other possible names for this:
tidyCDA
, tidyfreq
...
I see three areas that could be developed here:
Current non-tidy data forms and operations, following [@FriendlyMeyer:2016:DDAR] are described in the vignette Creating and manipulating frequency tables
It seems clear that the most flexible and general form, and one that most closely matches a tidy data frame is case form, because this allows for numeric variables as well. Thus:
Among these, tidy categorical data should best be represented as a
tibble
in case form. A tibble is convenient for its' printing.
The methods xtabs()
, table()
and expand.dft()
described in that vignette allow conversion from one form to another.
Tidy equivalents might be:
as_table()
, as_matrix()
, as_array()
to convert from any form to table/array form
Similarly, perhaps as_caseform()
, as_freqform()
to convert to those. There is already as.data.frame(table)
to convert to frequency form, and expand.dft()
converts that to case form
data("HairEyeColor") hec.df <- as.data.frame(HairEyeColor) head(hec.df) # expand to case form expand.dft(hec.df) |> head()
vcd::structable()
produces a ‘flat’ representation of a high-dimensional contingency table constructed by recursive splits (similar to the construction of mosaic displays). One can be constructed from a table or from a data frame with
a formula method,structable(Titanic) structable(Sex + Class ~ Survived + Age, data = Titanic)
and there are a suite of methods for indexing and selecting parts of an $n$-way table.
The methods in the plyr
package (now retired) provided a coherent set of tools for a split-apply-combine
strategy that works nicely with multidimensional arrays. Perhaps there are some useful ideas for frequency tables
that could be resurrected here.
There is also a role for purrr
methods and thinking here: $n$-way tables as nested lists/arrays?
The ideas of mapping over these?
Also needed:
methods for recoding and collapsing the levels of a factor: forcats::fct_recode()
, forcats::fct_collapse()
, forcats::fct_lump_min()
are useful here.
methods for reordering the levels of a factor, either manually or for some analysis purpose. For example, Data from @Glass:54 gave this 5 x 5 table on the occupations of 3500 British fathers and their sons, where the occupational categories are listed in alphabetic order.
data(Glass, package="vcdExtra") str(Glass) (glass.tab <- xtabs(Freq ~ father + son, data=Glass))
This can be reordered manually by indexing, to arrange the categories by status, giving an order Professional
down to Unskilled
:
# reorder by status ord <- c(2, 1, 4, 3, 5) glass.tab[ord, ord]
A more general method is to permute the row and column categories in the order implied by correspondence analysis dimensions. This is implemented in the seriation
package using the CA
method of seriation::seriate()
and applying permute()
to the result.
library(seriation) order <- seriate(glass.tab, method = "CA") # the permuted row and column labels rownames(glass.tab)[order[[1]]] # reorder rows and columns permute(glass.tab, order)
What are tidy ways to do these things?
The standard analysis of frequency data is in the form of loglinear models fit by MASS::loglm()
or with more flexible versions fit with glm()
or gnm::gnm()
. These are essentially linear models for the log
of frequency. In vcdExtra
, there are several methods for a list of such models, of class "glmlist"
and "loglmlist"
,
and these should be accommodated in tidy methods.
What is needed are broom
methods for loglm
models. The information required is accessible from standard functions, but not in a tidy form.
glance.loglm()
-- model level statistics. These are given in the output of the print()
method, and available from the print()
method. A complication is both LR and Pearson $\chi^2$ are reported, so these would need to be made to appear in separate columns. There are also related LRstats()
functions in vcdExtra
, which report AIC
and BIC
.hec.indep <- loglm(~Hair+Eye+Sex, data=HairEyeColor) hec.indep # extract test statistics summary(hec.indep)$tests LRstats(hec.indep)
tidy.loglm()
--- coefficient level statistics. These are available from coef.loglm()
. They would need to assembled into a long format. Standard errors & p-values might be a problem.coef(hec.indep)
augment.loglm()
--- should give case level statistics: fitted values, residuals, ...fitted(hec.indep) residuals(hec.indep)
What about hatvalues
? Not implemented, but shouldn't be too hard.
hatvalues(hec.indep)
The most common graphical methods are those implemented in vcd
: mosaic()
association plots (assoc()
), ..., which
rely on vcd::strucplot()
described in [@vcd:Meyer+Zeileis+Hornik:2006b].
Is there a tidy analog that might work with ggplot2
? The ggmosaic
package implements basic marimeko-style mosaic plots. They are not very general, in that they cannot do residual-based shading to show the patterns of association.
However, they are based on a productplots package by Hadley, which seems to provide some basic structure for constructing such displays of nested rectangles.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.