% !Rnw weave = Sweave %\VignetteEngine{Sweave} %\VignetteIndexEntry{Tutorial: Working with categorical data with R and the vcd package} %\VignetteDepends{vcd,gmodels,ca} %\VignetteKeywords{contingency tables, mosaic plots, sieve plots, categorical data, independence, conditional independence, R} %\VignettePackage{vcdExtra}

\documentclass[10pt,twoside]{article} \usepackage{Sweave} \usepackage{bm} \usepackage[toc]{multitoc} % for table of contents

% from Z.cls \usepackage[authoryear,round,longnamesfirst]{natbib} \bibpunct{(}{)}{;}{a}{}{,}

\usepackage{hyperref} \usepackage{color} %% colors \definecolor{Red}{rgb}{0.7,0,0} \definecolor{Blue}{rgb}{0,0,0.8} \hypersetup{% hyperindex = {true}, colorlinks = {true}, % linktocpage = {true}, plainpages = {false}, linkcolor = {Blue}, citecolor = {Blue}, urlcolor = {Red}, pdfstartview = {Fit}, pdfpagemode = {UseOutlines}, pdfview = {XYZ null null null} } %\AtBeginDocument{ % \hypersetup{% % pdfauthor = {Michael Friendly}, % pdftitle = {Tutorial: Working with categorical data with R and the vcd package}, % pdfkeywords = {contingency tables, mosaic plots, sieve plots, categorical data, independence, conditional independence, R} % } %}

% math stuff \newcommand{\given}{\ensuremath{\, | \,}} \renewcommand{\vec}[1]{\ensuremath{\bm{#1}}} \newcommand{\mat}[1]{\ensuremath{\bm{#1}}} \newcommand{\trans}{\ensuremath{^\mathsf{T}}} \newcommand{\diag}[1]{\ensuremath{\mathrm{diag} (#1)}} \def\binom#1#2{{#1 \choose #2}}% \newcommand{\implies}{ \ensuremath{\mapsto} }

\newenvironment{equation*}{\displaymath}{\enddisplaymath}%

\newcommand{\tabref}[1]{Table~\@ref(#1)} \newcommand{\figref}[1]{Figure~\@ref(#1)} \newcommand{\secref}[1]{Section~\@ref(#1)} \newcommand{\loglin}{loglinear }

%\usepackage{thumbpdf}

% page dimensions \addtolength{\hoffset}{-1.5cm} \addtolength{\textwidth}{3cm} \addtolength{\voffset}{-1cm} \addtolength{\textheight}{2cm}

% Vignette examples \newcommand*{\Example}{\fbox{Example:} }

% R stuff \newcommand{\var}[1]{#1} \newcommand{\data}[1]{#1} \newcommand{\class}[1]{\textsf{"#1"}} %% \code without -' ligatures \def\nohyphenation{\hyphenchar\font=-1 \aftergroup\restorehyphenation} \def\restorehyphenation{\hyphenchar\font=-} {\catcode\-=\active% \global\def\code{\bgroup% \catcode-=\active \let-\codedash% \Rdcode}} \def\codedash{-\discretionary{}{}{}} \def\Rdcode#1{\nohyphenation#1\egroup}

\newcommand{\codefun}[1]{#1()} \let\proglang=\textsf \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\Rpackage}[1]{{\textsf{#1}}}

%% almost as usual \author{Michael Friendly\York University, Toronto} \title{Working with categorical data with R and the vcd and vcdExtra packages} \date{\footnotesize{Using \Rpackage{vcdExtra} version r packageDescription("vcdExtra")[["Version"]] and \Rpackage{vcd} version r packageDescription("vcd")[["Version"]]; Date: r Sys.Date()}}

%% for pretty printing and a nice hypersummary also set: %\Plainauthor{Michael Friendly} %% comma-separated %\Shorttitle{vcd tutorial} %% a short title (if necessary) %\Plaintitle{Tutorial: Working with categorical data with R and the vcd package}

%\SweaveOpts{engine=R,eps=TRUE,height=6,width=7,results=hide,fig=FALSE,echo=TRUE} \SweaveOpts{engine=R,height=6,width=7,results=hide,fig=FALSE,echo=TRUE} \SweaveOpts{prefix.string=fig/vcd-tut,eps=FALSE} \SweaveOpts{keep.source=TRUE} %\SweaveOpts{concordance=TRUE} \setkeys{Gin}{width=0.7\textwidth}

set.seed(1071)
#library(vcd)
library(vcdExtra)
library(ggplot2)
#data(Titanic)
data(HairEyeColor)
data(PreSex)
data(Arthritis)
art <- xtabs(~Treatment + Improved, data = Arthritis)
if(!file.exists("fig")) dir.create("fig")

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\SweaveOpts{concordance=TRUE}

%% an abstract and keywords

This tutorial describes the creation and manipulation of frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results. The framework is provided by the R package vcd, but other packages are used to help with various tasks.
The vcdExtra package extends the graphical and statistical methods provided by vcd.

This package is now the main support package for the book \emph{Discrete Data Analysis with R: Visualizing and Modeling Techniques for Categorical and Count Data} [@FriendlyMeyer:2016:DDAR]. The web page for the book, ddar.datavis.ca, gives further details.

%\keywords{contingency tables, mosaic plots, sieve plots, %categorical data, independence, conditional independence, generalized linear models, %R} %\Plainkeywords{contingency tables, mosaic plots, % sieve plots, categorical data, independence, % conditional independence, generalized linear models, R}

{\small % \sloppy % {2} \tableofcontents % }

Introduction {#sec:intro}

%% Note: If there is markup in (sub)section, then it has to be escape as above.

This tutorial, part of the vcdExtra package, describes how to work with categorical data in the context of fitting statistical models in R and visualizing the results using the vcd and vcdExtra packages. It focuses first on methods and tools for creating and manipulating R data objects which represent frequency and contingency tables involving categorical variables.

Further sections describe some simple methods for calculating tests of independence and measures of association among categorical variables, and also methods for graphically displaying results.

There is much more to the analysis of categorical data than is described here, where the emphasis is on cross-tabulated tables of frequencies (``contingency tables''), statistical tests, associated loglinear models, and visualization of how variables are related.

A more general treatment of graphical methods for categorical data is contained in the book, \emph{Discrete Data Analysis with R: Visualizing and Modeling Techniques for Categorical and Count Data} [@FriendlyMeyer:2016:DDAR]. An earlier book using SAS is Visualizing Categorical Data [@vcd:Friendly:2000], for which vcd is a partial R companion, covering topics not otherwise available in R.
On the other hand, the implementation of graphical methods in vcd is more general in many respects than what I provided in SAS. Statistical models for categorical data in R have been extended considerably with the gnm package for generalized nonlinear models. The vcdExtra package extends vcd methods to models fit using glm() and gnm().

A more complete theoretical description of these statistical methods is provided in Agresti's \citeyearpar{vcd:Agresti:2002,Agresti:2013} Categorical Data Analysis. For this, see the Splus/R companion by Laura Thompson, [http://www.stat.ufl.edu/~aa/cda/Thompson_manual.pdf] and Agresti's support web page, [http://www.stat.ufl.edu/~aa/cda/cda.html].

Creating and manipulating frequency tables {#sec:creating}

R provides many methods for creating frequency and contingency tables. Several are described below. In the examples below, we use some real examples and some anonymous ones, where the variables A, B, and C represent categorical variables, and X represents an arbitrary R data object.

The first thing you need to know is that categorical data can be represented in three different forms in R, and it is sometimes necessary to convert from one form to another, for carrying out statistical tests, fitting models or visualizing the results. Once a data object exists in R, you can examine its complete structure with the str() function, or view the names of its components with the names() function.

Example: The Arthritis data is available in case form in the vcd package. There are two explanatory factors: Treatment and Sex. Age is a numeric covariate, and Improved is the response--- an ordered factor, with levels r paste(levels(Arthritis$Improved),collapse=' < ').
Excluding Age, we would have a $2 \times 2 \times 3$ contingency table for Treatment, Sex and Improved. %"None" < "Some" < "Marked".

names(Arthritis)      # show the variables
str(Arthritis)        # show the structure
head(Arthritis,5)     # first 5 observations, same as Arthritis[1:5,] 

Example: For small frequency tables, it is often convenient to enter them in frequency form using expand.grid() for the factors and c() to list the counts in a vector. The example below, from [@vcd:Agresti:2002] gives results for the 1991 General Social Survey, with respondents classified by sex and party identification.

# Agresti (2002), table 3.11, p. 106
GSS <- data.frame(
  expand.grid(sex=c("female", "male"), 
              party=c("dem", "indep", "rep")),
  count=c(279,165,73,47,225,191))
GSS
names(GSS)
str(GSS)
sum(GSS$count)

Example: The HairEyeColor is stored in table form in vcd.

str(HairEyeColor)                      # show the structure
sum(HairEyeColor)                      # number of cases
sapply(dimnames(HairEyeColor), length) # table dimension sizes

Example: Enter frequencies in a matrix, and assign dimnames, giving the variable names and category labels. Note that, by default, matrix() uses the elements supplied by columns in the result, unless you specify byrow=TRUE.

## A 4 x 4 table  Agresti (2002, Table 2.8, p. 57) Job Satisfaction
JobSat <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4)
dimnames(JobSat) = list(income=c("< 15k", "15-25k", "25-40k", "> 40k"),
                satisfaction=c("VeryD", "LittleD", "ModerateS", "VeryS"))
JobSat

JobSat is a matrix, not an object of class("table"), and some functions are happier with tables than matrices. You can coerce it to a table with as.table(),

JobSat <- as.table(JobSat)
str(JobSat)

Ordered factors and reordered tables {#sec:ordered-factors}

In table form, the values of the table factors are ordered by their position in the table. Thus in the JobSat data, both income and satisfaction represent ordered factors, and the positions of the values in the rows and columns reflects their ordered nature.

Yet, for analysis, there are time when you need numeric values for the levels of ordered factors in a table, e.g., to treat a factor as a quantitative variable. In such cases, you can simply re-assign the dimnames attribute of the table variables. For example, here, we assign numeric values to income as the middle of their ranges, and treat satisfaction as equally spaced with integer scores.

dimnames(JobSat)$income<-c(7.5,20,32.5,60)
dimnames(JobSat)$satisfaction<-1:4

For the HairEyeColor data, hair color and eye color are ordered arbitrarily. For visualizing the data using mosaic plots and other methods described below, it turns out to be more useful to assure that both hair color and eye color are ordered from dark to light. Hair colors are actually ordered this way already, and it is easiest to re-order eye colors by indexing. Again str() is your friend.

HairEyeColor <- HairEyeColor[, c(1,3,4,2), ]
str(HairEyeColor)

This is also the order for both hair color and eye color shown in the result of a correspondence analysis (\figref{fig:ca-haireye}) below.

With data in case form or frequency form, when you have ordered factors represented with character values, you must ensure that they are treated as ordered in R.% \footnote{In SAS, many procedures offer the option order = data | internal | formatted to allow character values to be ordered according to (a) their order in the data set, (b) sorted internal value, or (c) sorted formatted representation provided by a SAS format. }

Imagine that the Arthritis data was read from a text file.
By default the Improved will be ordered alphabetically: Marked, None, Some--- not what we want. In this case, the function ordered() (and others) can be useful.

Arthritis <- read.csv("arthritis.txt",header=TRUE)
Arthritis$Improved <- ordered(Arthritis$Improved, levels=c("None", "Some", "Marked"))

With this order of Improved, the response in this data, a mosaic display of Treatment and Improved (\figref{fig:arthritis})shows a clearly interpretable pattern.

mosaic(art, gp = shading_max, split_vertical = TRUE, main="Arthritis: [Treatment] [Improved]")

%\setkeys{Gin}{width=0.7\textwidth} [htb]

%r %mosaic(art, gp = shading_max, split_vertical = TRUE, main="Arthritis: [Treatment] [Improved]") % \includegraphics[width=0.7\textwidth]{fig/vcd-tut-Arthritis} \caption{Mosaic plot for the Arthritis data, showing the marginal model of independence for Treatment and Improved. Age, a covariate, and Sex are ignored here.} {#fig:arthritis}

Finally, there are situations where, particularly for display purposes, you want to re-order the dimensions of an $n$-way table, or change the labels for the variables or levels. This is easy when the data are in table form: aperm() permutes the dimensions, and assigning to names and dimnames changes variable names and level labels respectively. We will use the following version of UCBAdmissions in \secref{sec:mantel} below.% \footnote{ Changing Admit to Admit? might be useful for display purposes, but is dangerous--- because it is then difficult to use that variable name in a model formula. See \secref{sec:tips} for options labeling_args and set_labels to change variable and level names for displays in the strucplot framework. }

UCB <- aperm(UCBAdmissions, c(2, 1, 3))
dimnames(UCB)[[2]] <- c("Yes", "No")
names(dimnames(UCB)) <- c("Sex", "Admit?", "Department")
ftable(UCB)

%There is one subtle `gotcha'' here:aperm()returns an object of class *"array"*, %whereasUCBAdmissionsis of class *"table"*, so methods defined fortable%objects will not work on the permuted array. %The solution is to reassign theclassof the result ofaperm()`. % %r %class(UCBAdmissions) %class(UCB) %str(as.data.frame(UCBAdmissions)) # OK %str(as.data.frame(UCB)) # wrong % %class(UCB) <- "table" %str(as.data.frame(UCB)) # now OK % %

structable() {#sec:structable}

For 3-way and larger tables the structable() function in vcd provides a convenient and flexible tabular display. The variables assigned to the rows and columns of a two-way display can be specified by a model formula.

structable(HairEyeColor)                   # show the table: default
structable(Hair+Sex ~ Eye, HairEyeColor)   # specify col ~ row variables

It also returns an object of class "structable" which may be plotted with mosaic() (not shown here).

HSE < - structable(Hair+Sex ~ Eye, HairEyeColor)   # save structable object
mosaic(HSE)                                        # plot it

table and friends() {#sec:table}

You can generate frequency tables from factor variables using the table() function, tables of proportions using the prop.table() function, and marginal frequencies using margin.table().

 n=500
 A <- factor(sample(c("a1","a2"), n, rep=TRUE))
 B <- factor(sample(c("b1","b2"), n, rep=TRUE))
 C <- factor(sample(c("c1","c2"), n, rep=TRUE))
 mydata <- data.frame(A,B,C)
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B)   # A will be rows, B will be columns
mytable                 # print table

margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)

prop.table(mytable)    # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages

table() can also generate multidimensional tables based on 3 or more categorical variables. In this case, use the ftable() or structable() function to print the results more attractively.

# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)

table() ignores missing values by default. To include NA as a category in counts, include the table option exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor using \code{newfactor <- factor(oldfactor, exclude=NULL)}.

xtabs() {#sec:xtabs}

The xtabs() function allows you to create cross-tabulations of data using formula style input. This typically works with case-form data supplied in a data frame or a matrix. The result is a contingency table in array format, whose dimensions are determined by
the terms on the right side of the formula.

# 3-Way Frequency Table
mytable <- xtabs(~A+B+C, data=mydata)
ftable(mytable)    # print table
summary(mytable)   # chi-square test of indepedence

If a variable is included on the left side of the formula, it is assumed to be a vector of frequencies (useful if the data have already been tabulated in frequency form).

(GSStab <- xtabs(count ~ sex + party, data=GSS))
summary(GSStab)

Collapsing over table factors: aggregate, \codefun{margin.table() and apply()}

It sometimes happens that we have a data set with more variables or factors than we want to analyse, or else, having done some initial analyses, we decide that certain factors are not important, and so should be excluded from graphic displays by collapsing (summing) over them. For example, mosaic plots and fourfold displays are often simpler to construct from versions of the data collapsed over the factors which are not shown in the plots.

The appropriate tools to use again depend on the form in which the data are represented--- a case-form data frame, a frequency-form data frame (aggregate()), or a table-form array or table object (margin.table() or apply()).

When the data are in frequency form, and we want to produce another frequency data frame, aggregate() is a handy tool, using the argument FUN=sum to sum the frequency variable over the factors not mentioned in the formula.

Example: The data frame DaytonSurvey in the vcdExtra package represents a $2^5$ table giving the frequencies of reported use (``ever used?'') of alcohol, cigarettes and marijuana in a sample of high school seniors, also classified by sex and race.

data(DaytonSurvey, package = "vcdExtra")
str(DaytonSurvey)
head(DaytonSurvey)

To focus on the associations among the substances, we want to collapse over sex and race. The right-hand side of the formula used in the call to aggregate() gives the factors to be retained in the new frequency data frame, Dayton.ACM.df.

# data in frequency form
# collapse over sex and race
Dayton.ACM.df <- aggregate(Freq ~ cigarette+alcohol+marijuana, 
                           data=DaytonSurvey, FUN=sum)
Dayton.ACM.df

When the data are in table form, and we want to produce another table, apply() with FUN=sum can be used in a similar way to sum the table over dimensions not mentioned in the MARGIN argument. margin.table() is just a wrapper for apply() using the sum() function.

Example: To illustrate, we first convert the DaytonSurvey to a 5-way table using xtabs(), giving Dayton.tab.

# in table form
Dayton.tab <- xtabs(Freq~cigarette+alcohol+marijuana+sex+race, data=DaytonSurvey)
structable(cigarette+alcohol+marijuana ~ sex+race, data=Dayton.tab)

Then, use apply() on Dayton.tab to give the 3-way table Dayton.ACM.tab summed over sex and race. The elements in this new table are the column sums for Dayton.tab shown by structable() just above.

# collapse over sex and race
Dayton.ACM.tab <- apply(Dayton.tab, MARGIN=1:3, FUN=sum)
Dayton.ACM.tab <- margin.table(Dayton.tab, 1:3)   # same result
structable(cigarette+alcohol ~ marijuana, data=Dayton.ACM.tab)

Many of these operations can be performed using the **ply() functions in the plyr package. For example, with the data in a frequency form data frame, use ddply() to collapse over unmentioned factors, and plyr::summarise() as the function to be applied to each piece.

library(plyr)
Dayton.ACM.df <- plyr::ddply(DaytonSurvey, 
                             .(cigarette, alcohol, marijuana), 
                             plyr::summarise, Freq=sum(Freq))

Dayton.ACM.df

Collapsing table levels: collapse.table()

A related problem arises when we have a table or array and for some purpose we want to reduce the number of levels of some factors by summing subsets of the frequencies. For example, we may have initially coded Age in 10-year intervals, and decide that, either for analysis or display purposes, we want to reduce Age to 20-year intervals. The collapse.table() function in vcdExtra was designed for this purpose.

Example: Create a 3-way table, and collapse Age from 10-year to 20-year intervals. First, we generate a $2 \times 6 \times 3$ table of random counts from a Poisson distribution with mean of 100.

# create some sample data in frequency form
sex <- c("Male", "Female")
age <- c("10-19", "20-29",  "30-39", "40-49", "50-59", "60-69")
education <- c("low", 'med', 'high')
data <- expand.grid(sex=sex, age=age, education=education)
counts <- rpois(36, 100)   # random Possion cell frequencies
data <- cbind(data, counts)

# make it into a 3-way table
t1 <- xtabs(counts ~ sex + age + education, data=data)
structable(t1)

Now collapse age to 20-year intervals, and education to 2 levels. In the arguments, levels of age and education given the same label are summed in the resulting smaller table.

# collapse age to 3 levels, education to 2 levels
t2 <- collapse.table(t1, 
         age=c("10-29", "10-29",  "30-49", "30-49", "50-69", "50-69"),
         education=c("<high", "<high", "high"))
structable(t2)

Converting among frequency tables and data frames

As we've seen, a given contingency table can be represented equivalently in different forms, but some R functions were designed for one particular representation. \tabref{tab:convert} shows some handy tools for converting from one form to another.

[htb] \caption{Tools for converting among different forms for categorical data} {#tab:convert}

{llll} \hline & \multicolumn{3}{c}{To this} \ From this & Case form & Frequency form & Table form \ \hline Case form & noop & xtabs(~A+B) & table(A,B) \ Frequency form & expand.dft(X) & noop & xtabs(count~A+B)\ Table form & expand.dft(X) & as.data.frame(X) & noop \ \hline

A contingency table in table form (an object of class(table)) can be converted to a data.frame with as.data.frame().% \footnote{ Because R is object-oriented, this is actually a short-hand for the function as.data.frame.table(). }
The resulting data.frame contains columns representing the classifying factors and the table entries (as a column named by the responseName argument, defaulting to Freq. This is the inverse of xtabs().

Example: Convert the GSStab in table form to a data.frame in frequency form.

as.data.frame(GSStab)

Example: Convert the Arthritis data in case form to a 3-way table of Treatment $\times$ Sex $\times$ Improved. Note the use of with() to avoid having to use Arthritis\$Treatment etc. within the call to table().% \footnote{ table() does not allow a data argument to provide an environment in which the table variables are to be found. In the examples in \secref{sec:table} I used attach(mydata) for this purpose, but attach() leaves the variables in the global environment, while with() just evaluates the table() expression in a temporary environment of the data. }

Art.tab <-with(Arthritis, table(Treatment, Sex, Improved))
str(Art.tab)
ftable(Art.tab)

There may also be times that you will need an equivalent case form data.frame with factors representing the table variables rather than the frequency table. For example, the mca() function in package MASS only operates on data in this format. Marc Schwartz provided code for expand.dft() on the Rhelp mailing list for converting a table back into a case form data.frame. This function is included in vcdExtra.

Example: Convert the Arthritis data in table form (Art.tab) back to a data.frame in case form, with factors Treatment, Sex and Improved.

Art.df <- expand.dft(Art.tab)
str(Art.df)

A complex example {#sec:complex}

If you've followed so far, you're ready for a more complicated example. The data file, tv.dat represents a 4-way table of size $5 \times 11 \times 5 \times 3$ where the table variables (unnamed in the file) are read as V1 -- V4, and the cell frequency is read as V5. The file, stored in the doc/extdata directory of vcdExtra, can be read as follows:

tv.data<-read.table(system.file("extdata","tv.dat", package="vcdExtra"))
head(tv.data,5)

For a local file, just use read.table() in this form:

tv.data<-read.table("C:/R/data/tv.dat")

The data tv.dat came from the initial implementation of mosaic displays in R by Jay Emerson. In turn, they came from the initial development of mosaic displays [@vcd:Hartigan+Kleiner:1984] that illustrated the method with data on a large sample of TV viewers whose behavior had been recorded for the Neilson ratings. This data set contains sample television audience data from Neilsen Media Research for the week starting November 6, 1995.

The table variables are:\ ~~~V1-- values 1:5 correspond to the days Monday--Friday;\ ~~~V2-- values 1:11 correspond to the quarter hour times 8:00PM through 10:30PM;\ ~~~V3-- values 1:5 correspond to ABC, CBS, NBC, Fox, and non-network choices;\ ~~~V4-- values 1:3 correspond to transition states: turn the television Off, Switch channels, or Persist in viewing the current channel.

We are interested just the cell frequencies, and rely on the facts that the (a) the table is complete--- there are no missing cells, so nrow(tv.data)=r nrow(tv.data); (b) the observations are ordered so that V1 varies most rapidly and V4 most slowly. From this, we can just extract the frequency column and reshape it into an array.

TV <- array(tv.data[,5], dim=c(5,11,5,3))                                        
dimnames(TV) <- list(c("Monday","Tuesday","Wednesday","Thursday","Friday"), 
                c("8:00","8:15","8:30","8:45","9:00","9:15","9:30",         
                  "9:45","10:00","10:15","10:30"),                            
                c("ABC","CBS","NBC","Fox","Other"), c("Off","Switch","Persist"))
names(dimnames(TV))<-c("Day", "Time", "Network", "State")

More generally (even if there are missing cells), we can use xtabs() (or plyr::daply()) to do the cross-tabulation, using V5 as the frequency variable. Here's how to do this same operation with xtabs():

TV <- xtabs(V5 ~ ., data=tv.data)
dimnames(TV) <- list(Day=c("Monday","Tuesday","Wednesday","Thursday","Friday"), 
                Time=c("8:00","8:15","8:30","8:45","9:00","9:15","9:30",         
                       "9:45","10:00","10:15","10:30"),                            
                Network=c("ABC","CBS","NBC","Fox","Other"), 
                State=c("Off","Switch","Persist"))

But this 4-way table is too large and awkward to work with. Among the networks, Fox and Other occur infrequently. We can also cut it down to a 3-way table by considering only viewers who persist with the current station.% \footnote{This relies on the fact that that indexing an array drops dimensions of length 1 by default, using the argument drop=TRUE; the result is coerced to the lowest possible dimension. }

TV <- TV[,,1:3,]     # keep only ABC, CBS, NBC
TV <- TV[,,,3]       # keep only Persist -- now a 3 way table
structable(TV)

Finally, for some purposes, we might want to collapse the 11 times into a smaller number. Here, we use as.data.frame.table() to convert the table back to a data frame, levels() to re-assign the values of Time, and finally, xtabs() to give a new, collapsed frequency table.

TV.df <- as.data.frame.table(TV)
levels(TV.df$Time) <- c(rep("8:00-8:59",4),rep("9:00-9:59",4), rep("10:00-10:44",3))
TV2 <- xtabs(Freq ~ Day + Time + Network, TV.df)
structable(Day ~ Time+Network,TV2)

Whew! See \figref{fig:TV-mosaic} for a mosaic plot of the TV2 data.

Tests of Independence

CrossTable

OK, now we're ready to do some analyses. For tabular displays, the CrossTable() function in the gmodels package produces cross-tabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS. It has a wealth of options for the quantities that can be shown in each cell.

# 2-Way Cross Tabulation
library(gmodels)
CrossTable(GSStab,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE)

There are options to report percentages (row, column, cell), specify decimal places, produce Chi-square, Fisher, and McNemar tests of independence, report expected and residual values (pearson, standardized, adjusted standardized), include missing values as valid, annotate with row and column titles, and format as SAS or SPSS style output! See help(CrossTable) for details.

Chi-square test

For 2-way tables you can use chisq.test() to test independence of the row and column variable. By default, the $p$-value is calculated from the asymptotic chi-squared distribution of the test statistic. Optionally, the $p$-value can be derived via Monte Carlo simulation.

(HairEye <- margin.table(HairEyeColor, c(1, 2)))
chisq.test(HairEye)

Fisher Exact Test {#sec:Fisher}

fisher.test(X) provides an exact test of independence. X must be a two-way contingency table in table form. Another form, fisher.test(X, Y) takes two categorical vectors of the same length.
For tables larger than $2 \times 2$ the method can be computationally intensive (or can fail) if the frequencies are not small.

fisher.test(GSStab)

But this does not work because HairEye data has $n$=592 total frequency. An exact test is unnecessary in this case.

fisher.test(HairEye)

%# r %# #cat(try(fisher.test(HairEye))) %#

Error in fisher.test(HairEye) : FEXACT error 6. LDKEY is too small for this problem. Try increasing the size of the workspace.

Mantel-Haenszel test and conditional association {#sec:mantel}

Use the mantelhaen.test(X) function to perform a Cochran-Mantel-Haenszel $\chi^2$ chi test of the null hypothesis that two nominal variables are conditionally independent, $A \perp B \given C$, in each stratum, assuming that there is no three-way interaction. X is a 3 dimensional contingency table, where the last dimension refers to the strata.

The UCBAdmissions serves as an example of a $2 \times 2 \times 6$ table, with Dept as the stratifying variable.

## UC Berkeley Student Admissions
mantelhaen.test(UCBAdmissions)

The results show no evidence for association between admission and gender when adjusted for department. However, we can easily see that the assumption of equal association across the strata (no 3-way association) is probably violated. For $2 \times 2 \times k$ tables, this can be examined from the odds ratios for each $2 \times 2$ table (oddsratio()), and tested by using woolf_test() in vcd.

%r %oddsRatio <- function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]) %apply(UCBAdmissions, 3, oddsRatio) % %woolf_test(UCBAdmissions) %

oddsratio(UCBAdmissions, log=FALSE)
lor <- oddsratio(UCBAdmissions)  # capture log odds ratios
summary(lor)
woolf_test(UCBAdmissions) 

We can visualize the odds ratios of Admission for each department with fourfold displays using fourfold(). The cell frequencies $n_{ij}$ of each $2 \times 2$ table are shown as a quarter circle whose radius is proportional to $\sqrt{n_{ij}}$, so that its area is proportional to the cell frequency. Confidence rings for the odds ratio allow a visual test of the null of no association; the rings for adjacent quadrants overlap iff the observed counts are consistent with the null hypothesis. In the extended version (the default), brighter colors are used where the odds ratio is significantly different from 1. The following lines produce \figref{fig:fourfold1}.% \footnote{The color values col[3:4] were modified from their default values to show a greater contrast between significant and insignificant associations here.}

col <- c("#99CCFF", "#6699CC", "#F9AFAF", "#6666A0", "#FF0000", "#000080")
fourfold(UCB,mfrow=c(2,3), color=col)

%\setkeys{Gin}{width=0.8\textwidth} [htb]

%r %col <- c("#99CCFF", "#6699CC", "#F9AFAF", "#6666A0", "#FF0000", "#000080") %fourfold(UCB,mfrow=c(2,3), color=col) % \includegraphics[width=0.8\textwidth,trim=80 50 80 50]{fig/vcd-tut-fourfold1} \caption{Fourfold display for the UCBAdmissions data. Where the odds ratio differs significantly from 1.0, the confidence bands do not overlap, and the circle quadrants are shaded more intensely.} {#fig:fourfold1}

Another vcd function, cotabplot(), provides a more general approach to visualizing conditional associations in contingency tables, similar to trellis-like plots produced by coplot() and lattice graphics. The panel argument supplies a function used to render each conditional subtable. The following gives a display (not shown) similar to \figref{fig:fourfold1}.

cotabplot(UCB, panel = cotab_fourfold)

When we want to view the conditional probabilities of a response variable (e.g., Admit) in relation to several factors, an alternative visualization is a doubledecker() plot. This plot is a specialized version of a mosaic plot, which highlights the levels of a response variable (plotted vertically) in relation to the factors (shown horizontally). The following call produces \figref{fig:doubledecker}, where we use indexing on the first factor (Admit) to make Admitted the highlighted level.

In this plot, the association between Admit and Gender is shown where the heights of the highlighted conditional probabilities do not align. The excess of females admitted in Dept A stands out here.

doubledecker(Admit ~ Dept + Gender, data=UCBAdmissions[2:1,,])

[htb]

\includegraphics[width=0.9\textwidth]{fig/vcd-tut-doubledecker} \caption{Doubledecker display for the UCBAdmissions data. The heights of the highlighted bars show the conditional probabilities of Admit, given Dept and Gender.} {#fig:doubledecker}

Finally, the there is a plot() method for oddsratio objects. By default, it shows the 95% confidence interval for the log odds ratio. \figref{fig:oddsratio} is produced by:

plot(lor, xlab="Department", ylab="Log Odds Ratio (Admit | Gender)")

\setkeys{Gin}{width=0.5\textwidth} [htb]

plot(lor, xlab="Department", ylab="Log Odds Ratio (Admit | Gender)")

\caption{Log odds ratio plot for the UCBAdmissions data.} {#fig:oddsratio}

Cochran-Mantel-Haenszel tests for ordinal factors {#sec:CMH}

The standard $\chi^2$ tests for association in a two-way table treat both table factors as nominal (unordered) categories. When one or both factors of a two-way table are quantitative or ordinal, more powerful tests of association may be obtaianed by taking ordinality into account, using row and or column scores to test for linear trends or differences in row or column means.

More general versions of the CMH tests (Landis etal., 1978) are provided by assigning numeric scores to the row and/or column variables. For example, with two ordinal factors (assumed to be equally spaced), assigning integer scores, 1:R and 1:C tests the linear $\times$ linear component of association. This is statistically equivalent to the Pearson correlation between the integer-scored table variables, with $\chi^2 = (n-1) r^2$, with only 1 $df$ rather than $(R-1)\times(C-1)$ for the test of general association.

When only one table variable is ordinal, these general CMH tests are analogous to an ANOVA, testing whether the row mean scores or column mean scores are equal, again consuming fewer $df$ than the test of general association.

The CMHtest() function in vcdExtra now calculates these various CMH tests for two possibly ordered factors, optionally stratified other factor(s).

Example: Recall the $4 \times 4$ table, JobSat introduced in \secref{sec:creating},

JobSat

Treating the satisfaction levels as equally spaced, but using midpoints of the income categories as row scores gives the following results:

CMHtest(JobSat, rscores=c(7.5,20,32.5,60))

Note that with the relatively small cell frequencies, the test for general give no evidence for association. However, the the cor test for linear x linear association on 1 df is nearly significant. The coin contains the functions cmh_test() and lbl_test() for CMH tests of general association and linear x linear association respectively.

Measures of Association

There are a variety of statistical measures of strength of association for contingency tables--- similar in spirit to $r$ or $r^2$ for continuous variables. With a large sample size, even a small degree of association can show a significant $\chi^2$, as in the example below for the GSS data.

The assocstats() function in vcd calculates the $\phi$ contingency coefficient, and Cramer's V for an $r \times c$ table. The input must be in table form, a two-way $r \times c$ table.
It won't work with GSS in frequency form, but by now you should know how to convert.

assocstats(GSStab)

For tables with ordinal variables, like JobSat, some people prefer the Goodman-Kruskal $\gamma$ statistic (\citet[\S 2.4.3]{vcd:Agresti:2002}) based on a comparison of concordant and discordant pairs of observations in the case-form equivalent of a two-way table.

GKgamma(JobSat)

A web article by Richard Darlington, [http://www.psych.cornell.edu/Darlington/crosstab/TABLE0.HTM] gives further description of these and other measures of association.

Measures of Agreement

The Kappa() function in the vcd package calculates Cohen's $\kappa$ and weighted $\kappa$ for a square two-way table with the same row and column categories [@Cohen:60].% \footnote{ Don't confuse this with kappa() in base R that computes something entirely different (the condition number of a matrix). } Normal-theory $z$-tests are obtained by dividing $\kappa$ by its asymptotic standard error (ASE). A confint() method for Kappa objects provides confidence intervals.

(K <- Kappa(SexualFun))
confint(K)

A visualization of agreement, both unweighted and weighted for degree of departure from exact agreement is provided by the agreementplot() function. \figref{fig:agreesex} shows the agreementplot for the SexualFun data, produced as shown below. The Bangdiwala measures represent the proportion of the shaded areas of the diagonal rectangles, using weights $w_1$ for exact agreement, and $w_2$ for partial agreement one step from the main diagonal.

agree <- agreementplot(SexualFun, main="Is sex fun?")
unlist(agree)

%\setkeys{Gin}{width=0.5\textwidth} [htb]

%r %agree <- agreementplot(SexualFun, main="Is sex fun?") %agree % \includegraphics[width=0.4\textwidth,trim=50 25 50 25]{fig/vcd-tut-agreesex} \caption{Agreement plot for the SexualFun data.} {#fig:agreesex}

In other examples, the agreement plot can help to show sources of disagreement. For example, when the shaded boxes are above or below the diagonal (red) line, a lack of exact agreement can be attributed in part to different frequency of use of categories by the two raters-- lack of marginal homogeneity.

Correspondence analysis

Use the ca package for correspondence analysis for visually exploring relationships between rows and columns in contingency tables. For an $r \times c$ table, the method provides a breakdown of the Pearson $\chi^2$ for association in up to $M = \min(r-1, c-1)$ dimensions, and finds scores for the row ($x_{im}$) and column ($y_{jm}$) categories such that the observations have the maximum possible correlations.% \footnote{ Related methods are the non-parametric CMH tests using assumed row/column scores (\secref{sec:CMH}), the analogous glm() model-based methods (\secref{sec:CMH}), and the more general RC models which can be fit using gnm(). Correspondence analysis differs in that it is a primarily descriptive/exploratory method (no significance tests), but is directly tied to informative graphic displays of the row/column categories. }

Here, we carry out a simple correspondence analysis of the HairEye data. The printed results show that nearly 99% of the association between hair color and eye color can be accounted for in 2 dimensions, of which the first dimension accounts for 90%.

library(ca)
ca(HairEye)

The resulting ca object can be plotted just by running the plot() method on the ca object, giving the result in \figref{fig:ca-haireye}. plot.ca() does not allow labels for dimensions; these can be added with title(). It can be seen that most of the association is accounted for by the ordering of both hair color and eye color along Dimension 1, a dark to light dimension.

plot(ca(HairEye), main="Hair Color and Eye Color")
title(xlab="Dim 1 (89.4%)", ylab="Dim 2 (9.5%)")

\setkeys{Gin}{width=0.7\textwidth} [htb]

plot(ca(HairEye), main="Hair Color and Eye Color")
title(xlab="Dim 1 (89.4%)", ylab="Dim 2 (9.5%)")

\caption{Correspondence analysis plot for the HairEye data.} {#fig:ca-haireye}

Loglinear Models {#sec:loglin}

You can use the loglm() function in the MASS package to fit log-linear models. Equivalent models can also be fit (from a different perspective) as generalized linear models with the glm() function using the family='poisson' argument, and the gnm package provides a wider range of generalized nonlinear models, particularly for testing structured associations. The visualization methods for these models were originally developed for models fit using loglm(), so this approach is emphasized here. Some extensions of these methods for models fit using glm() and gnm() are contained in the vcdExtra package and illustrated in \secref{sec:glm}.

Assume we have a 3-way contingency table based on variables A, B, and C.
The possible different forms of loglinear models for a 3-way table are shown in \tabref{tab:loglin-3way}. The Model formula column shows how to express each model for loglm() in R.% \footnote{ For glm(), or gnm(), with the data in the form of a frequency data.frame, the same model is specified in the form glm(Freq $\sim$ ..., family="poisson"), where Freq is the name of the cell frequency variable and ... specifies the Model formula. } In the Interpretation column, the symbol "$\perp$" is to be read as "is independent of," and "$\given$" means "conditional on,'' or "adjusting for," or just "given".

[htb] \caption{Log-linear Models for Three-Way Tables} {#tab:loglin-3way}

{llll} \hline Model & Model formula & Symbol& Interpretation \ \hline\hline Mutual independence & ~A + B + C & $[A][B][C]$ & $A \perp B \perp C$ \ Joint independence & ~A*B + C & $[AB][C]$ & $(A \: B) \perp C$ \ Conditional independence & ~(A+B)*C & $[AC][BC]$ & $(A \perp B) \given C$ \ All two-way associations & ~A*B + A*C + B*C & $[AB][AC][BC]$ & homogeneous association \ Saturated model & ~A*B*C & $[ABC]$ & 3-way association \ \hline

For example, the formula ~A + B + C specifies the model of mutual independence with no associations among the three factors. In standard notation for the expected frequencies $m_{ijk}$, this corresponds to

\log ( m_{ijk} ) = \mu + \lambda_i^A + \lambda_j^B + \lambda_k^C \equiv `A + B + C`

The parameters $\lambda_i^A , \lambda_j^B$ and $\lambda_k^C$ pertain to the differences among the one-way marginal frequencies for the factors A, B and C.

Similarly, the model of joint independence, $(A \: B) \perp C$, allows an association between A and B, but specifies that C is independent of both of these and their combinations,

\log ( m_{ijk} ) = \mu + \lambda_i^A + \lambda_j^B + \lambda_k^C + \lambda_{ij}^{AB} \equiv `A * B + C`

where the parameters $\lambda_{ij}^{AB}$ pertain to the overall association between A and B (collapsing over C).

In the literature or text books, you will often find these models expressed in shorthand symbolic notation, using brackets, [ ] to enclose the high-order terms in the model. Thus, the joint independence model can be denoted [AB][C], as shown in the Symbol column in \tabref{tab:loglin-3way}.

Models of conditional independence allow (and fit) two of the three possible two-way associations. There are three such models, depending on which variable is conditioned upon. For a given conditional independence model, e.g., [AB][AC], the given variable is the one common to all terms, so this example has the interpretation $(B \perp C) \given A$.

Fitting with loglm() {#sec:loglm}

For example, we can fit the model of mutual independence among hair color, eye color and sex in HairEyeColor as

library(MASS)
## Independence model of hair and eye color and sex.  
hec.1 <- loglm(~Hair+Eye+Sex, data=HairEyeColor)
hec.1

Similarly, the models of conditional independence and joint independence are specified as

## Conditional independence
hec.2 <- loglm(~(Hair + Eye) * Sex, data=HairEyeColor)
hec.2
## Joint independence model.  
hec.3 <- loglm(~Hair*Eye + Sex, data=HairEyeColor)
hec.3

Note that printing the model gives a brief summary of the goodness of fit. A set of models can be compared using the anova() function.

anova(hec.1, hec.2, hec.3)

%Martin Theus and Stephan Lauer have written an excellent article on Visualizing %Loglinear Models, using mosaic plots. There is also great tutorial example by %Kevin Quinn on analyzing loglinear models via glm.

Fitting with glm and \codefun{gnm()} {#sec:glm}

The glm() approach, and extensions of this in the gnm package allows a much wider class of models for frequency data to be fit than can be handled by loglm(). Of particular importance are models for ordinal factors and for square tables, where we can test more structured hypotheses about the patterns of association than are provided in the tests of general association under loglm(). These are similar in spirit to the non-parametric CMH tests described in \secref{sec:CMH}.

Example: The data Mental in the vcdExtra package gives a two-way table in frequency form classifying young people by their mental health status and parents' socioeconomic status (SES), where both of these variables are ordered factors.

str(Mental)
xtabs(Freq ~ mental+ses, data=Mental)   # display the frequency table

Simple ways of handling ordinal variables involve assigning scores to the table categories, and the simplest cases are to use integer scores, either for the row variable (column effects'' model), the column variable (row effects'' model), or both (``uniform association'' model).

indep <- glm(Freq ~ mental + ses, family = poisson, data = Mental)  # independence model

To fit more parsimonious models than general association, we can define numeric scores for the row and column categories

# Use integer scores for rows/cols 
Cscore <- as.numeric(Mental$ses)
Rscore <- as.numeric(Mental$mental) 

Then, the row effects model, the column effects model, and the uniform association model can be fit as follows:

# column effects model (ses)
coleff <- glm(Freq ~ mental + ses + Rscore:ses, family = poisson, data = Mental)

# row effects model (mental)
roweff <- glm(Freq ~ mental + ses + mental:Cscore, family = poisson, data = Mental)

# linear x linear association
linlin <- glm(Freq ~ mental + ses + Rscore:Cscore, family = poisson, data = Mental)

The Summarize() in vcdExtra provides a nice, compact summary of the fit statistics for a set of models, collected into a glmlist object. Smaller is better for AIC and BIC.

# compare models using AIC, BIC, etc
vcdExtra::LRstats(glmlist(indep, roweff, coleff, linlin))

For specific model comparisons, we can also carry out tests of nested models with anova() when those models are listed from smallest to largest. Here, there are two separate paths from the most restrictive (independence) model through the model of uniform association, to those that allow only one of row effects or column effects.

anova(indep, linlin, coleff, test="Chisq")  
anova(indep, linlin, roweff, test="Chisq")  

The model of linear by linear association seems best on all accounts. For comparison, one might try the CMH tests on these data:

CMHtest(xtabs(Freq~ses+mental, data=Mental))

Non-linear terms

The strength of the gnm package is that it handles a wide variety of models that handle non-linear terms, where the parameters enter the model beyond a simple linear function. The simplest example is the Goodman RC(1) model, which allows a multiplicative term to account for the association of the table variables. In the notation of generalized linear models with a log link, this can be expressed as

\log \mu_{ij} = \alpha_i + \beta_j + \gamma_{i} \delta_{j}

where the row-multiplicative effect parameters $\gamma_i$ and corresponding column parameters $\delta_j$ are estimated from the data.% \footnote{ This is similar in spirit to a correspondence analysis with a single dimension, but as a statistical model. } Similarly, the RC(2) model adds two multiplicative terms to the independence model,

\log \mu_{ij} = \alpha_i + \beta_j + \gamma_{i1} \delta_{j1} + \gamma_{i2} \delta_{j2}

In the gnm package, these models may be fit using the Mult() to specify the multiplicative term, and instances() to specify several such terms.

Example: For the Mental data, we fit the RC(1) and RC(2) models, and compare these with the independence model.

RC1 <- gnm(Freq ~ mental + ses + Mult(mental,ses), data=Mental, 
             family=poisson, verbose=FALSE)
RC2 <- gnm(Freq ~ mental+ses + instances(Mult(mental,ses),2), data=Mental, 
             family=poisson, verbose=FALSE)
anova(indep, RC1, RC2, test="Chisq")

Mosaic plots {#sec:mosaic}

Mosaic plots provide an ideal method both for visualizing contingency tables and for visualizing the fit--- or more importantly--- lack of fit of a loglinear model. For a two-way table, mosaic() fits a model of independence, $[A][B]$ or ~A+B as an R formula. For $n$-way tables, mosaic() can fit any loglinear model, and can also be used to plot a model fit with loglm(). See @vcd:Friendly:1994,vcd:Friendly:1999 for the statistical ideas behind these uses of mosaic displays in connection with loglinear models.

The essential idea is to recursively sub-divide a unit square into rectangular "tiles" for the cells of the table, such that the are area of each tile is proportional to the cell frequency. For a given loglinear model, the tiles can then be shaded in various ways to reflect the residuals (lack of fit) for a given model. The pattern of residuals can then be used to suggest a better model or understand where a given model fits or does not fit.

mosaic() provides a wide range of options for the directions of splitting, the specification of shading, labeling, spacing, legend and many other details. It is actually implemented as a special case of a more general class of displays for $n$-way tables called strucplot, including sieve diagrams, association plots, double-decker plots as well as mosaic plots. For details, see help(strucplot) and the "See also" links, and also @vcd:Meyer+Zeileis+Hornik:2006b, which is available as an R vignette via vignette("strucplot", package="vcd").

\figref{fig:arthritis}, showing the association between Treatment and Improved was produced with the following call to mosaic().

mosaic(art, gp = shading_max, split_vertical = TRUE, 
       main="Arthritis: [Treatment] [Improved]")

Note that the residuals for the independence model were not large (as shown in the legend), yet the association between Treatment and Improved is highly significant.

summary(art)

In contrast, one of the other shading schemes, from @vcd:Friendly:1994 (use: gp = shading_Friendly), uses fixed cutoffs of $\pm 2, \pm 4$, to shade cells which are individually significant at approximately $\alpha = 0.05$ and $\alpha = 0.001$ levels, respectively. The right panel below uses gp = shading_Friendly.

\setkeys{Gin}{width=0.5\textwidth}

mosaic(art, gp = shading_max, split_vertical = TRUE, 
       main="Arthritis: gp = shading_max")
mosaic(art, gp = shading_Friendly, split_vertical = TRUE, 
       main="Arthritis: gp = shading_Friendly")

Mosaics for loglinear models {#sec:mosaic-llm}

When you have fit a loglinear model using loglm(), and saved the result (as a loglm object) the simplest way to display the results is to use the plot() method for the loglm object. Calling mosaic(loglm.object) has the same result. In \secref{sec:loglm} above, we fit several different models to the HairEyeColor data. We can produce mosaic displays of each just by plotting them:

# mosaic plots, using plot.loglm() method
plot(hec.1, main="model: [Hair][Eye][Sex]")
plot(hec.2, main="model: [HairSex][EyeSex]")
plot(hec.3, main="model: [HairEye][Sex]")

\setkeys{Gin}{width=0.32\textwidth}

plot(hec.1, main="model: [Hair][Eye][Sex]")
plot(hec.2, main="model: [HairSex][EyeSex]")
plot(hec.3, main="model: [HairSex][EyeSex]")

Alternatively, you can supply the model formula to mosaic() with the expected argument. This is passed to loglm(), which fits the model, and returns residuals used for shading in the plot.

For example, here we examine the TV2 constructed in \secref{sec:complex} above. The goal is to see how Network choice depends on (varies with) Day and Time. To do this:

* We fit a model of joint independence of

Network on the combinations of Day and Time, with the model formula ~Day:Time + Network. * To make the display more easily read, we place Day and Time on the vertical axis and Network on the horizontal, * The Time values overlap on the right vertical axis, so we use level() to abbreviate them. mosaic() also supports a more sophisticated set of labeling functions. Instead of changing the data table, we could have used labeling_args = list(abbreviate = c(Time = 2)) for a similar effect.

The following call to mosaic() produces \figref{fig:TV-mosaic}:

dimnames(TV2)$Time <- c("8", "9", "10")     # re-level for mosaic display
mosaic(~ Day + Network + Time, data=TV2, expected=~Day:Time + Network, 
         legend=FALSE, gp=shading_Friendly)

\setkeys{Gin}{width=0.75\textwidth} [htb]

dimnames(TV2)$Time <- c("8", "9", "10")     # re-level for mosaic display
mosaic(~ Day + Network + Time, data=TV2, expected=~Day:Time + Network, 
         legend=FALSE, gp=shading_Friendly)

\caption{Mosaic plot for the TV data showing model of joint independence, Day:Time + Network .} {#fig:TV-mosaic}

From this, it is easy to read from the display how network choice varies with day and time. For example, CBS dominates in all time slots on Monday; ABC and NBC dominate on Tuesday, particularly in the later time slots; Thursday is an NBC day, while on Friday, ABC gets the greatest share.

In interpreting this mosaic and other plots, it is important to understand that associations included in the model---here, that between day and time---are not shown in the shading of the cells, because they have been fitted (taken into account) in the loglinear model.

For comparison, you might want to try fitting the model of homogeneous association. This allows all pairs of factors to be associated, but asserts that each pairwise association is the same across the levels of the remaining factor. The resulting plot displays the contributions to a 3-way association, but is not shown here.

mosaic(~ Day + Network + Time, data=TV2, 
         expected=~Day:Time + Day:Network + Time:Network, 
         legend=FALSE, gp=shading_Friendly)

Mosaics for glm and \codefun{gnm() models} {#sec:mosglm}

The vcdExtra package provides an additional method, mosaic.glm() for models fit with glm() and gnm().% \footnote{ Models fit with gnm() are of class = c("gnm", "glm", "lm"), so all *.glm methods apply, unless overridden in the gnm package. } These are not restricted to the Poisson family, but only apply to cases where the response variable is non-negative.

Example: Here, we plot the independence and the linear-by-linear association model for the Mental health data from \secref{sec:glm}. These examples illustrate some of the options for labeling (variable names and residuals printed in cells). Note that the formula supplied to mosaic() for glm objects refers to the order of factors displayed in the plot, not the model.

long.labels <- list(set_varnames = c(mental="Mental Health Status", ses="Parent SES"))
mosaic(indep, ~ses+mental, residuals_type="rstandard",  
     labeling_args = long.labels, labeling=labeling_residuals,
     main="Mental health data: Independence")

mosaic(linlin, ~ses+mental, residuals_type="rstandard", 
     labeling_args = long.labels, labeling=labeling_residuals, suppress=1, 
     gp=shading_Friendly, main="Mental health data: Linear x Linear")

\setkeys{Gin}{width=0.49\textwidth}

long.labels <- list(set_varnames = c(mental="Mental Health Status", ses="Parent SES"))
mosaic(indep, ~ses+mental, residuals_type="rstandard",  
     labeling_args = long.labels, labeling=labeling_residuals,
     main="Mental health data: Independence")
long.labels <- list(set_varnames = c(mental="Mental Health Status", ses="Parent SES"))
mosaic(linlin, ~ses+mental, residuals_type="rstandard", 
     labeling_args = long.labels, labeling=labeling_residuals, suppress=1, 
     gp=shading_Friendly, main="Mental health data: Linear x Linear")

The gnm package also fits a wide variety of models with nonlinear terms or terms for structured associations of table variables. In the following, we fit the RC(1) model

\log ( m_{ij} ) = \mu + \lambda_i^A + \lambda_j^B + \phi \mu_i \nu_j

This is similar to the linear by linear model, except that the row effect parameters ($\mu_i$) and column parameters ($\nu_j$) are estimated from the data rather than given assigned equally-spaced values. The multiplicative terms are specified by the Mult().

Mental$mental <- C(Mental$mental, treatment)
Mental$ses <- C(Mental$ses, treatment)
RC1model <- gnm(Freq ~ mental + ses + Mult(mental, ses),
                family = poisson, data = Mental)
mosaic(RC1model, residuals_type="rstandard", labeling_args = long.labels, 
       labeling=labeling_residuals, suppress=1, gp=shading_Friendly,
       main="Mental health data: RC(1) model")

Other forms of nonlinear terms are provided for the inverse of a predictor (Inv()) and the exponential of a predictor (Exp()). You should read vignette("gnmOverview", package="gnm") for further details.

Mosaic tips and techniques {#sec:tips}

The vcd package implements an extremely general collection of graphical methods for $n$-way frequency tables within the strucplot framework, which includes mosaic plots (mosaic()), as well as association plots (assoc()), sieve diagrams (sieve()), as well as tabular displays (structable()).

The graphical methods in vcd support a wide of options that control almost all of the details of the plots, but it is often difficult to determine what arguments you need to supply to achieve a given effect from the help(). As a first step, you should read the vignette("strucplot") in vcd to understand the overall structure of these plot methods. The notes below describe a few useful things that may not be obvious, or can be done in different ways.

Changing the labels for variables and levels

With data in contingency table form or as a frequency data frame, it often happens that the variable names and/or the level values of the factors, while suitable for analysis, are less than adequate when used in mosaic plots and other strucplot displays.

For example, we might prefer that a variable named ses appear as "Socioeconomic Status", or a factor with levels c("M", "F") be labeled using c("Male", "Female") in a plot. Or, sometimes we start with a factor whose levels are fully spelled out (e.g., c("strongly disagree", "disagree", "neutral", "agree", "strongly agree")), only to find that the level labels overlap in graphic displays.

The strucplot framework in vcd provides an extremely large variety of functions and options for controlling almost all details of text labels in mosaics and other plots. See help(labelings) for an overview.

For example, in \secref{sec:ordered-factors} we showed how to rearrange the dimensions of the UCBAdmissions table, change the names of the table variables, and relabel the levels of one of the table variables. The code below changes the actual table for plotting purposes, but we pointed out that these changes can create other problems in analysis.

UCB <- aperm(UCBAdmissions, c(2, 1, 3))
names(dimnames(UCB)) <- c("Sex", "Admit?", "Department")
dimnames(UCB)[[2]] <- c("Yes", "No")

The same effects can be achieved without modifying the data using the set_varnames and set_labels options in mosaic() as follows:

vnames <- list(set_varnames = c(Admit="Admission", Gender="Sex", Dept="Department"))

lnames <- list(Admit = c("Yes", "No"),
              Gender = c("Males", "Females"),
              Dept = LETTERS[1:6])

mosaic(UCBAdmissions, labeling_args=vnames, set_labels=lnames)

In some cases, it may be sufficient to abbreviate (or clip, or rotate) level names to avoid overlap. For example, the statements below produce another version of \figref{fig:TV-mosaic} with days of the week abbreviated to their first three letters. Section 4 in the vignette("strucplot") provides many other examples.

dimnames(TV2)$Time <- c("8", "9", "10")     # re-level for mosaic display
mosaic(~ Day + Network + Time, data=TV2, expected=~Day:Time + Network, 
         legend=FALSE, gp=shading_Friendly, 
         labeling_args=list(abbreviate=c(Day=3)) )

%### Fitting complex models with glm() and gnm()

Continuous predictors {#sec:contin}

When continuous predictors are available---and potentially important--- in explaining a categorical outcome, models for that outcome include: logistic regression (binary response), the proportional odds model (ordered polytomous response), multinomial (generalized) logistic regression. Many of these are special cases of the generalized linear model using the "poisson" or "binomial" family and their relatives.

Spine and conditional density plots {#sec:spine}

I don't go into fitting such models here, but I would be remiss not to illustrate some visualizations in vcd that are helpful here. The first of these is the spine plot or spinogram [@vcd:Hummel:1996] (produced with spine()). These are special cases of mosaic plots with specific spacing and shading to show how a categorical response varies with a continuous or categorical predictor.

They are also a generalization of stacked bar plots where not the heights but the widths of the bars corresponds to the relative frequencies of x. The heights of the bars then correspond to the conditional relative frequencies of {y} in every x group.

Example: For the Arthritis data, we can see how Improved varies with Age as follows. spine() takes a formula of the form y ~ x with a single dependent factor and a single explanatory variable x (a numeric variable or a factor). The range of a numeric variablex is divided into intervals based on the breaks argument, and stacked bars are drawn to show the distribution of y as x varies. As shown below, the discrete table that is visualized is returned by the function.

(spine(Improved ~ Age, data = Arthritis, breaks = 3))
(spine(Improved ~ Age, data = Arthritis, breaks = "Scott"))

\setkeys{Gin}{width=0.49\textwidth}

(spine(Improved ~ Age, data = Arthritis, breaks = 3))
(spine(Improved ~ Age, data = Arthritis, breaks = "Scott"))

The conditional density plot [@vcd:Hofmann+Theus] is a further generalization. This visualization technique is similar to spinograms, but uses a smoothing approach rather than discretizing the explanatory variable. As well, it uses the original x axis and not a distorted one.

\setkeys{Gin}{width=0.6\textwidth} [htb]

cdplot(Improved ~ Age, data = Arthritis)
with(Arthritis, rug(jitter(Age), col="white", quiet=TRUE))

\caption{Conditional density plot for the Arthritis data showing the variation of Improved with Age.} {#fig:cd-plot}

In such plots, it is useful to also see the distribution of the observations across the horizontal axis, e.g., with a rug() plot. \figref{fig:cd-plot} uses cdplot() from the graphics package rather than cd_plot() from vcd, and is produced with

cdplot(Improved ~ Age, data = Arthritis)
with(Arthritis, rug(jitter(Age), col="white", quiet=TRUE))

From \figref{fig:cd-plot} it can be easily seen that the proportion of patients reporting Some or Marked improvement increases with Age, but there are some peculiar bumps in the distribution. These may be real or artifactual, but they would be hard to see with most other visualization methods. When we switch from non-parametric data exploration to parametric statistical models, such effects are easily missed.

Model-based plots: effect plots and ggplot2 plots {#sec:modelplots}

The nonparametric conditional density plot uses smoothing methods to convey the distributions of the response variable, but displays that are simpler to interpret can often be obtained by plotting the predicted response from a parametric model.

For complex glm() models with interaction effects, the effects package provides the most useful displays, plotting the predicted values for a given term, averaging over other predictors not included in that term. I don't illustrate this here, but see @effects:1,effects:2 and help(package="effects").

Here I just briefly illustrate the capabilities of the ggplot2 package for model-smoothed plots of categorical responses in glm() models.

Example: The Donner data frame in vcdExtra gives details on the survival of 90 members of the Donner party, a group of people who attempted to migrate to California in 1846. They were trapped by an early blizzard on the eastern side of the Sierra Nevada mountains, and before they could be rescued, nearly half of the party had died. What factors affected who lived and who died?

data(Donner, package="vcdExtra")
str(Donner)

A potential model of interest is the logistic regression model for $Pr(survived)$, allowing separate fits for males and females as a function of age. The key to this is the stat_smooth() function, using method = "glm", method.args = list(family = binomial). The formula = y ~ x specifies a linear fit on the logit scale (\figref{fig:donner3}, left)

# separate linear fits on age for M/F
ggplot(Donner, aes(age, survived, color = sex)) +
  geom_point(position = position_jitter(height = 0.02, width = 0)) +
  stat_smooth(method = "glm", method.args = list(family = binomial), formula = y ~ x,
           alpha = 0.2, size=2, aes(fill = sex))

Alternatively, we can allow a quadratic relation with age by specifying formula = y ~ poly(x,2) (\figref{fig:donner3}, right).

# separate quadratics
ggplot(Donner, aes(age, survived, color = sex)) +
  geom_point(position = position_jitter(height = 0.02, width = 0)) +
  stat_smooth(method = "glm", method.args = list(family = binomial), formula = y ~ poly(x,2),
           alpha = 0.2, size=2, aes(fill = sex))

\setkeys{Gin}{width=0.49\textwidth} [htb]

ggplot(Donner, aes(age, survived, color = sex)) +
  geom_point(position = position_jitter(height = 0.02, width = 0)) +
  stat_smooth(method = "glm", method.args = list(family = binomial), formula = y ~ x,
           alpha = 0.2, size=2, aes(fill = sex))
# separate quadratics
ggplot(Donner, aes(age, survived, color = sex)) +
  geom_point(position = position_jitter(height = 0.02, width = 0)) +
  stat_smooth(method = "glm", method.args = list(family = binomial), formula = y ~ poly(x,2),
           alpha = 0.2, size=2, aes(fill = sex))

\caption{Logistic regression plots for the Donner data showing survival vs. age, by sex. Left: linear logistic model; right: quadratic model} {#fig:donner3}

These plots very nicely show (a) the fitted $Pr(survived)$ for males and females; (b) confidence bands around the smoothed model fits and (c) the individual observations by jittered points at 0 and 1 for those who died and survived, respectively.

References



friendly/vcdExtra documentation built on Aug. 30, 2023, 6:21 a.m.