Categorical Data Analysis {#cha-categorical-data-analysis}

##    IPSUR: Introduction to Probability and Statistics Using R
##    Copyright (C) 2018  G. Jay Kerns
#
##    Chapter: Categorical Data Analysis
#
##    This file is part of IPSUR.
#
##    IPSUR is free software: you can redistribute it and/or modify
##    it under the terms of the GNU General Public License as published by
##    the Free Software Foundation, either version 3 of the License, or
##    (at your option) any later version.
#
##    IPSUR is distributed in the hope that it will be useful,
##    but WITHOUT ANY WARRANTY; without even the implied warranty of
##    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##    GNU General Public License for more details.
#
##    You should have received a copy of the GNU General Public License
##    along with IPSUR.  If not, see <http://www.gnu.org/licenses/>.
knitr::opts_chunk$set(echo = TRUE)
set.seed(42)
library(binom)
library(prob)
library(reshape)
library(vcd)
Dataset = structure(list(School = structure(c(3L, 2L, 1L, 4L, 3L, 2L, 1L, 
4L, 3L, 2L, 1L, 4L), .Label = c("Adequate", "Good", "Most desirable", 
"Undesirable"), class = "factor"), Rating = structure(c(2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L), .Label = c("Average", 
"Outstanding", "Poor"), class = "factor"), Frequency = c(21, 
3, 14, 10, 20, 25, 8, 7, 4, 36, 2, 6)), row.names = c(NA, -12L
), .Names = c("School", "Rating", "Frequency"), class = "data.frame")

library(prob)
A = gen2wayTable(pmatrix = matrix(c(1,3,4,2), nrow = 2), 
                 addmargins = FALSE,
                 as.df = TRUE,
                 dmnames = list(gender = c("female","male"),
                                politics = c("dem", "rep")))
B = gen2wayTable(pmatrix = matrix(c(1,3,6,2,4,5), nrow = 2), 
                 addmargins = FALSE,
                 as.df = TRUE,
                 dmnames = list(gender = c("female","male"),
                                politics = c("dem", "ind", "rep")))
C = genIndepTable(prow = 1:2, pcol = 1:3, 
                 addmargins = FALSE,
                 as.df = TRUE,
                 dmnames = list(gender = c("female","male"),
                                politics = c("dem", "ind", "rep")))

Inference for One Proportion

For this section we observe a simple random sample of size (n) of successes and failures (that is, Bernoulli trials), and the parameter of interest is (\pi = \Pr(\text{Success})) on any one trial.

Asymptotic, or Wald interval (Laplace, 1812)

Let (y = \text{number of successes}) and (n = \text{sample size}), and set (\hat{\pi}= y/n). The (100(1-\alpha)\%) confidence interval for (\pi) is [ \hat{\pi} \pm z_{\alpha/2}\sqrt{\frac{\hat{\pi}(1 - \hat{\pi})}{n}}. ]

library(binom)
binom.confint(2, 10, methods = "asymptotic")

This interval has poor performance unless (n) is very large. The coverage probability often falls below (1 - \alpha) and can fall very low when (\pi) is near 0 or 1.

Wilson, or Score interval (E. B. Wilson, 1927)

Again take (\hat{\pi}= y/n). The confidence interval is [ \left.\left{\left(\hat{\pi}+\frac{z_{\alpha/2}^{2}}{2n}\right)\pm z_{\alpha/2}\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}+\frac{z_{\alpha/2}^{2}}{(2n)^{2}}}\right}\right/ \left(1+\frac{z_{\alpha/2}^{2}}{n}\right). ]

binom.confint(2, 10, methods = "wilson")

This is also the confidence interval reported from the prop.test function when correct = FALSE (see below).

This interval outperforms the Wald interval, but has poor coverage (less than (1 - \alpha)) when (\pi) is near 0 or 1.

Agresti-Coull interval (1998)

Let (\tilde{y} = y + 0.5z_{\alpha/2}^2) and (\tilde{n} = n + z_{\alpha/2}^2), and set (\tilde{\pi}= \tilde{y}/\tilde{n}). The confidence interval is [ \tilde{\pi} \pm z_{\alpha/2}\sqrt{\frac{\tilde{\pi}(1 - \tilde{\pi})}{n}}. ]

binom.confint(2, 10, methods = "ac")

For a 95% interval, this is the famous "add two successes and two failures" rule of thumb that yields [ \tilde{\pi} \approx \frac{y + 2}{n + 4}. ]

This interval performs much better than the Wald interval, and contains the Score interval. It does better than the Score interval when (\pi) is near 0 or 1.

Clopper-Pearson interval (1934)

The formula for this interval is difficult to write down compactly, but it comes from inverting the exact binomial test which is based on the binomial distribution.

binom.confint(2, 10, methods = "exact")

This interval always has coverage (\geq 1 - \alpha), but it is often too conservative.

Comparison

There are actually 11 different confidence intervals for a population proportion that R supports in package binom, and they all have their advantages and disadvantages. Below are all 11 intervals for an observed y = 2 successes in n = 10 independent trials.

binom.confint(2, 10, methods = "all")

Hypothesis tests

You can always construct a confidence interval and see if it covers the null hypothesized value, so that means using the above 11 intervals there are now 11 different hypothesis tests at our disposal. But the confidence interval method does not generate a p-value. The prop.test function with correct = FALSE is also known as the "Score Test" and will generate a p-value.

prop.test(x = 2, n = 10, p = 0.5, correct = FALSE)

Notice that the confidence interval reported by prop.test when correct = FALSE matches the wilson confidence interval reported above.

Two-way Contingency Tables

A two-way table looks like this:

xtabs(Frequency ~ School + Rating, data = Dataset)

There are two variables: a row variable and a column variable (hence the name). The numbers in the body of the table are counts of observations falling into the respective row/column combinations.

Enter by hand

If you already have a two-way table (say, from a textbook or website) then you can enter it into the computer manually like this:

x = matrix(c(1,2,3,4,5,6), nrow = 2,
           dimnames = list(gender = c("female","male"),
                           politics = c("dem","ind","rep")))
x

Technically speaking this is not a table object, it is a matrix, but it will suit our purposes.

From a data frame with xtabs

Most of the datasets from this chapter's exercises look like this after importing:

head(Dataset)

The important thing about this is that there are two columns which represent the categorical variables, and then a third column Frequency which gives the counts for each row/column combination. We make a two-way table from such a data frame like this:

x = xtabs(Frequency ~ School + Rating, data = Dataset)
x

Sometimes, though, the data frame does not have a Frequency column, and instead it has a bunch of repeated rows, one row for each observation falling in that row/column combination of the table, like this:

head(A)

(This is the case we usually have in data that we collect ourselves). We still use xtabs to make the table, but we leave the left hand side of the formula specification blank:

xtabs(~ politics, data = A)
xtabs(~ gender, data = A)
xtabs(~ gender + politics, data = A)

From a table with xtabs

In the examples below we start with a multi-way contingency table (3 or 4 dimensions) and we construct a two-way table for analysis using the same xtabs function. We just need to remember that when starting with a table we need to add a Freq to the left hand side of the formula specification, like this, for example using the built-in table HairEyeColor:

HairEyeColor

Now we can make a two-way table comparing, say, Hair and Eye.

xtabs(Freq ~ Hair + Eye, data = HairEyeColor)

Or we could do Sex and Eye.

xtabs(Freq ~ Hair + Eye, data = HairEyeColor)

Visualization

The most basic visual display of a two-way table is with a bar graph, constructed via the barplot function. Note that in every two-way table you need a legend to denote the values of the second variable.

The following is a stacked bar graph.

x = xtabs(Freq ~ Sex + Eye, data = HairEyeColor)
barplot(x, legend = TRUE)

Next is a side-by-side bar graph.

barplot(x, beside = TRUE, legend = TRUE)

We can swap the row/column variables via the transpose function t().

barplot(t(x), beside = TRUE, legend = TRUE)

A mosaic plot is important and can be used for even more than two-way tables (three-way, multi-way...).

library(vcd)
mosaic(x)

If you just have a one-way table then you can make a bar graph and you do not need a legend.

barplot(xtabs(Freq ~ Hair, data = HairEyeColor))

Pie graphs are often a bad idea for reasons discussed in class, but if you are hellbent on making one then use the pie function.

pie(xtabs(Freq ~ Hair, data = HairEyeColor))

Stuff to do with tables

We can calculate sums in the margin with the addmargins function.

x = xtabs(Freq ~ Eye, data = HairEyeColor)
addmargins(x)

The prop.table function will convert a frequency table to a relative frequency table of proportions.

prop.table(x)

Now we will do the same things for a two-way table such as this one:

y = xtabs(Freq ~ Hair + Eye, data = HairEyeColor)
y

We calculate row sums, column sums, both at once, and a table of proportions.

rowSums(y)
colSums(y)
addmargins(y)
prop.table(y)

We can add just the row sums and calculate the row conditional distributions:

addmargins(y, margin = 2)
prop.table(y, margin = 1)

Notice the rows sum to one in the above table. We can add just the column sums and calculate the column conditional distributions:

addmargins(y, margin = 1)
prop.table(y, margin = 2)

Notice the columns sum to one in this table.

Reorder rows and/or columns

Original table:

z = xtabs(Freq ~ Sex + Hair, data = HairEyeColor)
z

Reorder rows:

z[c(2,1), ]

Reorder columns:

z[ , c(3,1,4,2)]

Reorder rows and columns:

z[c(2,1), c(3,1,4,2)]

This is useful for all sorts of things, and one of them is to improve visual displays by ordering categories in decreasing/increasing frequency, like this:

z = xtabs(Freq ~ Hair, data = HairEyeColor)
z

We can make improved pie/bar graphs like this (notice the side-by-side display):

par(mfrow = c(1,2))
pie(z[c(2,4,1,3)])
barplot(z[c(2,4,1,3)])
par(mfrow = c(1,1))

There, that's better.

Untable

Maybe you start with a table (that you entered by hand, say), and you would like to have a data frame to analyze with the R Commander or any of our other aforementioned methods. First convert to a data frame:

Y = as.data.frame(y)
head(Y)

Then untable the data frame. The [, -3] and row.names parts are just there to make the end result look more pretty.

library(reshape)
Ydf = untable(Y, Y$Freq)[ , -3]
row.names(Ydf) <- NULL
head(Ydf)

Now we can set Ydf as the Active Dataset in the R Commander and proceed as usual.

Two Proportions

The model looks like this:

[ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & \pi_{1} & 1 - \pi_{1} & 1 \ \text{Group 2} & \pi_{2} & 1 - \pi_{2} & 1 \end{array} ]

We observe $n_{1}$ observations from Group 1 with $y_{1}$ successes, and we observe $n_{2}$ observations from Group 2 with $y_{2}$ successes, like this:

[ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & y_{1} & n_{1} - y_{1} & n_{1}\ \text{Group 2} & y_{2} & n_{2} - y_{2} & n_{2} \end{array} ]

We wish to test (for example, two-sided hypotheses): [ H_{0}: \pi_{1} = \pi_{2}\text{ versus } H_{a}:\pi_{1}\neq \pi_{2}. ]

Asymptotic test

Consider the following 2x2 table:

y = xtabs(Freq ~ Sex + Survived, data = Titanic)
y
addmargins(y, margin = 2)
prop.table(y, margin = 1)

We perform the hypothesis test with prop.test just like before with correct = FALSE.

prop.test(y, correct = FALSE)

There is a statistically significant difference between the probability of death for the two genders in the Titanic data.

Fisher's Exact Test

The lady in question (Dr. Muriel Bristol) claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. This test calculates the probability of her getting the specific number of cups she identified correct by chance alone.

LadyTastingTea = matrix(c(4,0,0,4), nrow = 2,
                dimnames = list(pouredFirst = c("milk","tea"),
                guessedFirst = c("milk","tea")))
LadyTastingTea
fisher.test(LadyTastingTea, alternative = "greater")

These data provide evidence that Dr. Bristol really could tell which was added first.

McNemar Test for Matched Pairs

For this type of problem we have matched pairs of data, for example, we may ask a single respondent on the same survey twice, one before and one after a treatment of some kind. The hypotheses of interest are [ H_{0}: \pi_{1+} = \pi_{+1}\text{ versus } H_{a}:\pi_{1+}\neq \pi_{+1} ] which after some algebra is equivalent to the hypotheses [ H_{0}: \pi_{12} = \pi_{21}\text{ versus } H_{a}:\pi_{12}\neq \pi_{21}. ]

The test statistic is [ \chi^{2}= \frac{n_{12}- n_{21}}{n_{12}+n_{21}}, ] and when the null hypothesis is true the statistic has a chi-square distribution with one degree of freedom. There is a continuity correction proposed by Edwards[^ed] that looks like this: [ \chi^{2}= \frac{\left(\vert n_{12}- n_{21}\vert - 1 \right)^2}{n_{12}+n_{21}}, ]

[^ed]: Edwards, A (1948). "Note on the "correction for continuity" in testing the significance of the difference between correlated proportions". Psychometrika. 13: 185–187. doi:10.1007/bf02289261

The following example data from Agresti were about a panel of 1600 voting-age British citizens. They were given a survey about whether they approved or disapproved of the Prime Minister's performance, and then those same 1600 people were asked again six months later.

Performance <- matrix(c(794, 86, 150, 570), nrow = 2,
  dimnames = list("1st Survey" = c("Approve", "Disapprove"),
                 "2nd Survey" = c("Approve", "Disapprove")))
Performance

We do the test like this:

mcnemar.test(Performance)

Here the proportion of those that approve after six months is significantly different from the proportion that approved before.

Chi-square Tests

All chi-square tests are based on a test statistic (\chi^{2}) which takes the general form [ \chi^2 = \sum_{i} \frac{(O_{i} - E_{i})^2}{E_{i}}, ] where (O_{i}) is an observed count (an entry in the table) and (E_{i}) is an expected value, that is, the value we would expect if the null hypothesis is true. When (H_{0}) is true the test statistic has an (asymptotic) chi-square distribution with some degrees of freedom (which depends on the specific hypotheses being tested).

Chi-square goodness of fit

For this class of problems we have a univariate categorical variable (like gender or politics) and a hypothesized model for the probabilities of the different categories. We would like to see how well the model fits the data.

Suppose the variable $X$ has $k$ categories, and we will write our hypothesized model this way: [ \Pr(X = \text{category }i) = \pi_{i},\ \text{for }i=1,2,\ldots,k. ] We observe a random sample of size (n). Here (O_{i}) is the number of observations in category (i), so that (O_{1}+O_{2}+\cdots+O_{k} = n). If the model truly holds then we would expect (E_{i}= n\pi_{i}) observations in category (i). The test statistic is [ \chi^2 = \sum_{i = 1}^k \frac{(O_{i} - E_{i})^2}{E_{i}} ] and when (H_{0}) is true, the test statistic has a chi-square distribution with $k - 1$ degrees of freedom.

To do the test with R we enter the observed counts in a vector observed (could also be a one-way table) and the hypothesized probabilities in a vector probs:

observed = c(160, 20, 10, 10)
probs = c(0.50, 0.25, 0.10, 0.15)

Then perform the test this way:

chisq.test(observed, p = probs)

In this case we reject the null hypothesis and the theoretical model does not fit these data very well (at all).

Chi-square tests of Independence and Homogeneity

The model is

[ \begin{array}{cccccc} & \text{Response 1} & \text{Response 2} & & \text{Response }c & \text{Total}\ \text{Group 1} & \pi_{11} & \pi_{12} & \cdots & \pi_{1c} & \pi_{1+} \ \text{Group 2} & \pi_{21} & \pi_{22} & \cdots & \pi_{2c} & \pi_{2+} \ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \ \text{Group }r & \pi_{r1} & \pi_{r2} & \cdots & \pi_{rc} & \pi_{r+} \ \text{Total} & \pi_{+1} & \pi_{+2} & \cdots & \pi_{+c} & 1 \ \end{array} ]

We observe $n_{ij}$ observations from Group $i$ with Response $j$, with $n$ being the total sample size, like this:

[ \begin{array}{cccccc} & \text{Response 1} & \text{Response 2} & & \text{Response }c & \text{Total}\ \text{Group 1} & n_{11} & n_{12} & \cdots & n_{1c} & n_{1+} \ \text{Group 2} & n_{21} & n_{22} & \cdots & n_{2c} & n_{2+} \ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \ \text{Group }r & n_{r1} & n_{r2} & \cdots & n_{rc} & n_{r+} \ \text{Total} & n_{+1} & n_{+2} & \cdots & n_{+c} & n \ \end{array} ]

Here, the test statistic takes the general form [ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, ] but the hypotheses are different so the expected counts are calculated differently and the degrees of freedom are now ((r - 1)(c - 1)) where (r) is the number of rows and (c) is the number of columns.

The conditions to use this test are

If expected counts are too small then we can

Independence

Here we observe a single random sample and we place observations in the table depending on their classification status. The hypotheses of interest are [ H_{0}: \pi_{ij} = \pi_{i+}\pi_{+j}\text{ versus } H_{a}:\pi_{ij}\neq \pi_{i+}\pi_{+j}, ] for (i=1,\ldots,r) and (j=1,\ldots,c). The expected counts are [ E_{ij} = n\hat{\pi}{i+}\hat{\pi}{+j} = n \frac{n_{i+}}{n}\frac{n_{+j}}{n} = \frac{n_{i+}n_{+j}}{n}. ]

Homogeneity

Here we have a collection of $r$ subpopulations and we observe simple random samples from each subpopulation (it does not matter whether rows or columns represent the different subpopulations). The hypotheses of interest to us now are are [ H_{0}: \text{the row distributions } (\pi_{i1}, \pi_{i2},\ldots\pi_{ic}) \text{ are all the same,} ] for (i=1,\ldots,r), versus [ H_{a}: \text{at least one row is different from the others.} ] The expected counts are calculated with the same formula, and the test is conducted the same way, but the interpretation is different.

How to do the tests

We have our data from above

x = xtabs(Freq ~ Sex + Eye, data = HairEyeColor)
x

We can compare the row distributions with the following sequence of steps.

addmargins(x, margin = 2)
prop.table(x, margin = 1)

The row distributions look similar to each other. Now we compare the column distributions.

addmargins(x, margin = 1)
prop.table(x, margin = 2)

Again, the column distributions look very similar. We expect the chi-square test not to be significant.

chisq.test(x)

Here is a visual display of same.

mosaic(x, shade = TRUE)
assoc(x, shade = TRUE)

Now let us see if hair color is related to eye color. Here are the data.

y = xtabs(Freq ~ Hair + Eye, data = HairEyeColor)
addmargins(y, margin = 2)
prop.table(y, margin = 1)

The row distributions are not similar at all (maybe brown and red, but not the other rows).

addmargins(y, margin = 1)
prop.table(y, margin = 2)

Neither are the column distributions. We expect the chi-square statistic to be significant.

chisq.test(y)

And here is a visual display of the same. Blue means higher observed frequency than would be expected under independence, red means lower than expected frequency.

mosaic(y, shade = TRUE)
assoc(y, shade = TRUE)

Odds and Odds Ratios

Odds of an event

Odds in favor of (A) is [ \text{Odds in favor of event }A = \frac{\Pr(A)}{1 - \Pr(A)} ]

Odds ratio

Here we have an underlying model that looks like this: [ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & \pi_{11} & \pi_{12} & \pi_{1+}\ \text{Group 2} & \pi_{21} & \pi_{22} & \pi_{2+}\ \text{Total} & \pi_{+1} & \pi_{+2} & 1 \end{array} ]

The odds ratio associated with this model is [ \psi = \frac{\pi_{11}/\pi_{12}}{\pi_{21}/\pi_{22}} = \frac{\pi_{11}\pi_{22}}{\pi_{21}\pi_{12}}, ] and it represents the odds of Success for Group 1 divided by the odds of Success for Group 2. We observe a sample and construct a sample contingency table which we write like this: [ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & n_{11} & n_{12} & n_{1+}\ \text{Group 2} & n_{21} & n_{22} & n_{2+}\ \text{Total} & n_{+1} & n_{+2} & n \end{array} ] Then the sample odds ratio (we usually just call it the odds ratio) is [ \hat{\psi} = \frac{n_{11}/n_{12}}{n_{21}/n_{22}} = \frac{n_{11}n_{22}}{n_{21}n_{12}}. ]

We can compute the odds ratio by hand, or use the oddsratio function in the vcd package. Here is an example table:

x = xtabs(Freq ~ Sex + Survived, data = Titanic)
x
library(vcd)
oddsratio(x, log = FALSE)

This means that the odds in favor of death were over 10 times higher for males than the odds for females (for the Titanic disaster).

We can make a visual display with a fourfold plot.

fourfold(x)

Asymptotic distribution of (\ln(\hat{\psi}))

The distribution of (\ln(\hat{\psi})) is asymptotically normal. [ \ln(\hat{\psi}) \overset{\cdot}{\sim} N\left( \ln(\psi),\, \frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}\right), ] for $n$ sufficiently large.

This information can be used to construct a large sample confidence interval for (\psi).

confint(oddsratio(x, log = FALSE), level = 0.95)

Cohran-Mantel-Haenszel (CMH) Test Statistic

For this class of problems we have a binary predictor $X$ (like treatment/placebo) and a categorical response $Y$ (like success/failure) but we also have another categorical variable $Z$ (like democrat, republican, independent...) against which we would like to control. We assume that the odds ratios between $X$ and $Y$ are the same at each level of $Z$, and we would like to know whether that common odds ratio (\psi \neq 1).

The CMH test statistic is [ \text{CMH} = \frac{\left[ \sum_{k}\left(n_{11k} - \frac{n_{1+k}n_{+1k}}{n_{++k}} \right)\right]^{2}}{\sum_{k} \frac{n_{1+k}n_{2+k}n_{+1k}n_{+2k}}{n_{++k}^{2}(n_{++k} - 1)}}. ]

The test statistic has an asymptotic chi-square distribution with one degree of freedom when the common odds ratio (\psi = 1).

MH (1959) proposed a continuity correction in their original paper, and if used, the p-value better approximates an exact conditional test, but it tends to be conservative (Agresti).

Note that if there is evidence that the odds ratio is different for the different levels of $Z$, then the CMH test is not appropriate. We can check for evidence of this with a fourfold plot and/or the woolf_test, both from package vcd.

The following data are from the Titanic tragedy.

x = xtabs(Freq ~ Sex + Age + Class, data = Titanic)
ftable(x)

We check for similarity of odds ratios. Here are the different odds ratios:

oddsratio(x, log = FALSE)

The estimates vary in some cases, but let us look to see if that variation is significant.

library(vcd)
woolf_test(x)

No problems there, and here is a graphical display.

fourfold(x)

The graph looks good. We go on to perform the CMH test.

mantelhaen.test(x)

The marginal odds ratio of Sex versus Age is significantly different from one (the common odds ratio is estimated to be approximately R mantelhaen.test(x)$estimate). We estimate the odds of being a male child at approximately half the odds of being a female child, consistently across social status.



Try the IPSUR package in your browser

Any scripts or data that you put into this service are public.

IPSUR documentation built on May 2, 2019, 9:15 a.m.