## IPSUR: Introduction to Probability and Statistics Using R ## Copyright (C) 2018 G. Jay Kerns # ## Chapter: Categorical Data Analysis # ## This file is part of IPSUR. # ## IPSUR is free software: you can redistribute it and/or modify ## it under the terms of the GNU General Public License as published by ## the Free Software Foundation, either version 3 of the License, or ## (at your option) any later version. # ## IPSUR is distributed in the hope that it will be useful, ## but WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ## GNU General Public License for more details. # ## You should have received a copy of the GNU General Public License ## along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
knitr::opts_chunk$set(echo = TRUE) set.seed(42) library(binom) library(prob) library(reshape) library(vcd)
Dataset = structure(list(School = structure(c(3L, 2L, 1L, 4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L, 4L), .Label = c("Adequate", "Good", "Most desirable", "Undesirable"), class = "factor"), Rating = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L), .Label = c("Average", "Outstanding", "Poor"), class = "factor"), Frequency = c(21, 3, 14, 10, 20, 25, 8, 7, 4, 36, 2, 6)), row.names = c(NA, -12L ), .Names = c("School", "Rating", "Frequency"), class = "data.frame") library(prob) A = gen2wayTable(pmatrix = matrix(c(1,3,4,2), nrow = 2), addmargins = FALSE, as.df = TRUE, dmnames = list(gender = c("female","male"), politics = c("dem", "rep"))) B = gen2wayTable(pmatrix = matrix(c(1,3,6,2,4,5), nrow = 2), addmargins = FALSE, as.df = TRUE, dmnames = list(gender = c("female","male"), politics = c("dem", "ind", "rep"))) C = genIndepTable(prow = 1:2, pcol = 1:3, addmargins = FALSE, as.df = TRUE, dmnames = list(gender = c("female","male"), politics = c("dem", "ind", "rep")))
For this section we observe a simple random sample of size (n) of successes and failures (that is, Bernoulli trials), and the parameter of interest is (\pi = \Pr(\text{Success})) on any one trial.
Let (y = \text{number of successes}) and (n = \text{sample size}), and set (\hat{\pi}= y/n). The (100(1-\alpha)\%) confidence interval for (\pi) is [ \hat{\pi} \pm z_{\alpha/2}\sqrt{\frac{\hat{\pi}(1 - \hat{\pi})}{n}}. ]
library(binom) binom.confint(2, 10, methods = "asymptotic")
This interval has poor performance unless (n) is very large. The coverage probability often falls below (1 - \alpha) and can fall very low when (\pi) is near 0 or 1.
Again take (\hat{\pi}= y/n). The confidence interval is [ \left.\left{\left(\hat{\pi}+\frac{z_{\alpha/2}^{2}}{2n}\right)\pm z_{\alpha/2}\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}+\frac{z_{\alpha/2}^{2}}{(2n)^{2}}}\right}\right/ \left(1+\frac{z_{\alpha/2}^{2}}{n}\right). ]
binom.confint(2, 10, methods = "wilson")
This is also the confidence interval reported from the prop.test
function when correct = FALSE
(see below).
This interval outperforms the Wald interval, but has poor coverage (less than (1 - \alpha)) when (\pi) is near 0 or 1.
Let (\tilde{y} = y + 0.5z_{\alpha/2}^2) and (\tilde{n} = n + z_{\alpha/2}^2), and set (\tilde{\pi}= \tilde{y}/\tilde{n}). The confidence interval is [ \tilde{\pi} \pm z_{\alpha/2}\sqrt{\frac{\tilde{\pi}(1 - \tilde{\pi})}{n}}. ]
binom.confint(2, 10, methods = "ac")
For a 95% interval, this is the famous "add two successes and two failures" rule of thumb that yields [ \tilde{\pi} \approx \frac{y + 2}{n + 4}. ]
This interval performs much better than the Wald interval, and contains the Score interval. It does better than the Score interval when (\pi) is near 0 or 1.
The formula for this interval is difficult to write down compactly, but it comes from inverting the exact binomial test which is based on the binomial distribution.
binom.confint(2, 10, methods = "exact")
This interval always has coverage (\geq 1 - \alpha), but it is often too conservative.
There are actually 11 different confidence intervals for a population proportion that R supports in package binom
, and they all have their advantages and disadvantages. Below are all 11 intervals for an observed y = 2
successes in n = 10
independent trials.
binom.confint(2, 10, methods = "all")
You can always construct a confidence interval and see if it covers the null hypothesized value, so that means using the above 11
intervals there are now 11
different hypothesis tests at our disposal. But the confidence interval method does not generate a p-value. The prop.test
function with correct = FALSE
is also known as the "Score Test" and will generate a p-value.
prop.test(x = 2, n = 10, p = 0.5, correct = FALSE)
Notice that the confidence interval reported by prop.test
when correct = FALSE
matches the wilson
confidence interval reported above.
A two-way table looks like this:
xtabs(Frequency ~ School + Rating, data = Dataset)
There are two variables: a row variable and a column variable (hence the name). The numbers in the body of the table are counts of observations falling into the respective row/column combinations.
If you already have a two-way table (say, from a textbook or website) then you can enter it into the computer manually like this:
x = matrix(c(1,2,3,4,5,6), nrow = 2, dimnames = list(gender = c("female","male"), politics = c("dem","ind","rep"))) x
Technically speaking this is not a table
object, it is a matrix
, but it will suit our purposes.
xtabs
Most of the datasets from this chapter's exercises look like this after importing:
head(Dataset)
The important thing about this is that there are two columns which represent the categorical variables, and then a third column Frequency
which gives the counts for each row/column combination. We make a two-way table from such a data frame like this:
x = xtabs(Frequency ~ School + Rating, data = Dataset) x
Sometimes, though, the data frame does not have a Frequency
column, and instead it has a bunch of repeated rows, one row for each observation falling in that row/column combination of the table, like this:
head(A)
(This is the case we usually have in data that we collect ourselves). We still use xtabs
to make the table, but we leave the left hand side of the formula specification blank:
xtabs(~ politics, data = A) xtabs(~ gender, data = A) xtabs(~ gender + politics, data = A)
xtabs
In the examples below we start with a multi-way contingency table (3 or 4 dimensions) and we construct a two-way table for analysis using the same xtabs
function. We just need to remember that when starting with a table we need to add a Freq
to the left hand side of the formula specification, like this, for example using the built-in table HairEyeColor
:
HairEyeColor
Now we can make a two-way table comparing, say, Hair
and Eye
.
xtabs(Freq ~ Hair + Eye, data = HairEyeColor)
Or we could do Sex
and Eye
.
xtabs(Freq ~ Hair + Eye, data = HairEyeColor)
The most basic visual display of a two-way table is with a bar graph, constructed via the barplot
function. Note that in every two-way table you need a legend
to denote the values of the second variable.
The following is a stacked bar graph.
x = xtabs(Freq ~ Sex + Eye, data = HairEyeColor) barplot(x, legend = TRUE)
Next is a side-by-side bar graph.
barplot(x, beside = TRUE, legend = TRUE)
We can swap the row/column variables via the transpose function t()
.
barplot(t(x), beside = TRUE, legend = TRUE)
A mosaic
plot is important and can be used for even more than two-way tables (three-way, multi-way...).
library(vcd) mosaic(x)
If you just have a one-way table then you can make a bar graph and you do not need a legend
.
barplot(xtabs(Freq ~ Hair, data = HairEyeColor))
Pie graphs are often a bad idea for reasons discussed in class, but if you are hellbent on making one then use the pie
function.
pie(xtabs(Freq ~ Hair, data = HairEyeColor))
We can calculate sums in the margin with the addmargins
function.
x = xtabs(Freq ~ Eye, data = HairEyeColor) addmargins(x)
The prop.table
function will convert a frequency table to a relative frequency table of proportions.
prop.table(x)
Now we will do the same things for a two-way table such as this one:
y = xtabs(Freq ~ Hair + Eye, data = HairEyeColor) y
We calculate row sums, column sums, both at once, and a table of proportions.
rowSums(y) colSums(y) addmargins(y) prop.table(y)
We can add just the row sums and calculate the row conditional distributions:
addmargins(y, margin = 2) prop.table(y, margin = 1)
Notice the rows sum to one in the above table. We can add just the column sums and calculate the column conditional distributions:
addmargins(y, margin = 1) prop.table(y, margin = 2)
Notice the columns sum to one in this table.
Original table:
z = xtabs(Freq ~ Sex + Hair, data = HairEyeColor) z
Reorder rows:
z[c(2,1), ]
Reorder columns:
z[ , c(3,1,4,2)]
Reorder rows and columns:
z[c(2,1), c(3,1,4,2)]
This is useful for all sorts of things, and one of them is to improve visual displays by ordering categories in decreasing/increasing frequency, like this:
z = xtabs(Freq ~ Hair, data = HairEyeColor) z
We can make improved pie/bar graphs like this (notice the side-by-side display):
par(mfrow = c(1,2)) pie(z[c(2,4,1,3)]) barplot(z[c(2,4,1,3)]) par(mfrow = c(1,1))
There, that's better.
Maybe you start with a table (that you entered by hand, say), and you would like to have a data frame to analyze with the R Commander or any of our other aforementioned methods. First convert to a data frame:
Y = as.data.frame(y) head(Y)
Then untable
the data frame. The [, -3]
and row.names
parts are just there to make the end result look more pretty.
library(reshape) Ydf = untable(Y, Y$Freq)[ , -3] row.names(Ydf) <- NULL head(Ydf)
Now we can set Ydf
as the Active Dataset in the R Commander and proceed as usual.
The model looks like this:
[ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & \pi_{1} & 1 - \pi_{1} & 1 \ \text{Group 2} & \pi_{2} & 1 - \pi_{2} & 1 \end{array} ]
We observe $n_{1}$ observations from Group 1 with $y_{1}$ successes, and we observe $n_{2}$ observations from Group 2 with $y_{2}$ successes, like this:
[ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & y_{1} & n_{1} - y_{1} & n_{1}\ \text{Group 2} & y_{2} & n_{2} - y_{2} & n_{2} \end{array} ]
We wish to test (for example, two-sided hypotheses): [ H_{0}: \pi_{1} = \pi_{2}\text{ versus } H_{a}:\pi_{1}\neq \pi_{2}. ]
Consider the following 2x2 table:
y = xtabs(Freq ~ Sex + Survived, data = Titanic) y
addmargins(y, margin = 2)
prop.table(y, margin = 1)
We perform the hypothesis test with prop.test
just like before with correct = FALSE
.
prop.test(y, correct = FALSE)
There is a statistically significant difference between the probability of death for the two genders in the Titanic
data.
The lady in question (Dr. Muriel Bristol) claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. This test calculates the probability of her getting the specific number of cups she identified correct by chance alone.
LadyTastingTea = matrix(c(4,0,0,4), nrow = 2, dimnames = list(pouredFirst = c("milk","tea"), guessedFirst = c("milk","tea"))) LadyTastingTea
fisher.test(LadyTastingTea, alternative = "greater")
These data provide evidence that Dr. Bristol really could tell which was added first.
For this type of problem we have matched pairs of data, for example, we may ask a single respondent on the same survey twice, one before and one after a treatment of some kind. The hypotheses of interest are [ H_{0}: \pi_{1+} = \pi_{+1}\text{ versus } H_{a}:\pi_{1+}\neq \pi_{+1} ] which after some algebra is equivalent to the hypotheses [ H_{0}: \pi_{12} = \pi_{21}\text{ versus } H_{a}:\pi_{12}\neq \pi_{21}. ]
The test statistic is [ \chi^{2}= \frac{n_{12}- n_{21}}{n_{12}+n_{21}}, ] and when the null hypothesis is true the statistic has a chi-square distribution with one degree of freedom. There is a continuity correction proposed by Edwards[^ed] that looks like this: [ \chi^{2}= \frac{\left(\vert n_{12}- n_{21}\vert - 1 \right)^2}{n_{12}+n_{21}}, ]
[^ed]: Edwards, A (1948). "Note on the "correction for continuity" in testing the significance of the difference between correlated proportions". Psychometrika. 13: 185–187. doi:10.1007/bf02289261
The following example data from Agresti were about a panel of 1600 voting-age British citizens. They were given a survey about whether they approved or disapproved of the Prime Minister's performance, and then those same 1600 people were asked again six months later.
Performance <- matrix(c(794, 86, 150, 570), nrow = 2, dimnames = list("1st Survey" = c("Approve", "Disapprove"), "2nd Survey" = c("Approve", "Disapprove"))) Performance
We do the test like this:
mcnemar.test(Performance)
Here the proportion of those that approve after six months is significantly different from the proportion that approved before.
All chi-square tests are based on a test statistic (\chi^{2}) which takes the general form [ \chi^2 = \sum_{i} \frac{(O_{i} - E_{i})^2}{E_{i}}, ] where (O_{i}) is an observed count (an entry in the table) and (E_{i}) is an expected value, that is, the value we would expect if the null hypothesis is true. When (H_{0}) is true the test statistic has an (asymptotic) chi-square distribution with some degrees of freedom (which depends on the specific hypotheses being tested).
For this class of problems we have a univariate categorical variable (like gender
or politics
) and a hypothesized model for the probabilities of the different categories. We would like to see how well the model fits the data.
Suppose the variable $X$ has $k$ categories, and we will write our hypothesized model this way: [ \Pr(X = \text{category }i) = \pi_{i},\ \text{for }i=1,2,\ldots,k. ] We observe a random sample of size (n). Here (O_{i}) is the number of observations in category (i), so that (O_{1}+O_{2}+\cdots+O_{k} = n). If the model truly holds then we would expect (E_{i}= n\pi_{i}) observations in category (i). The test statistic is [ \chi^2 = \sum_{i = 1}^k \frac{(O_{i} - E_{i})^2}{E_{i}} ] and when (H_{0}) is true, the test statistic has a chi-square distribution with $k - 1$ degrees of freedom.
To do the test with R we enter the observed counts in a vector observed
(could also be a one-way table) and the hypothesized probabilities in a vector probs
:
observed = c(160, 20, 10, 10) probs = c(0.50, 0.25, 0.10, 0.15)
Then perform the test this way:
chisq.test(observed, p = probs)
In this case we reject the null hypothesis and the theoretical model does not fit these data very well (at all).
The model is
[ \begin{array}{cccccc} & \text{Response 1} & \text{Response 2} & & \text{Response }c & \text{Total}\ \text{Group 1} & \pi_{11} & \pi_{12} & \cdots & \pi_{1c} & \pi_{1+} \ \text{Group 2} & \pi_{21} & \pi_{22} & \cdots & \pi_{2c} & \pi_{2+} \ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \ \text{Group }r & \pi_{r1} & \pi_{r2} & \cdots & \pi_{rc} & \pi_{r+} \ \text{Total} & \pi_{+1} & \pi_{+2} & \cdots & \pi_{+c} & 1 \ \end{array} ]
We observe $n_{ij}$ observations from Group $i$ with Response $j$, with $n$ being the total sample size, like this:
[ \begin{array}{cccccc} & \text{Response 1} & \text{Response 2} & & \text{Response }c & \text{Total}\ \text{Group 1} & n_{11} & n_{12} & \cdots & n_{1c} & n_{1+} \ \text{Group 2} & n_{21} & n_{22} & \cdots & n_{2c} & n_{2+} \ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \ \text{Group }r & n_{r1} & n_{r2} & \cdots & n_{rc} & n_{r+} \ \text{Total} & n_{+1} & n_{+2} & \cdots & n_{+c} & n \ \end{array} ]
Here, the test statistic takes the general form [ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, ] but the hypotheses are different so the expected counts are calculated differently and the degrees of freedom are now ((r - 1)(c - 1)) where (r) is the number of rows and (c) is the number of columns.
The conditions to use this test are
If expected counts are too small then we can
Here we observe a single random sample and we place observations in the table depending on their classification status. The hypotheses of interest are [ H_{0}: \pi_{ij} = \pi_{i+}\pi_{+j}\text{ versus } H_{a}:\pi_{ij}\neq \pi_{i+}\pi_{+j}, ] for (i=1,\ldots,r) and (j=1,\ldots,c). The expected counts are [ E_{ij} = n\hat{\pi}{i+}\hat{\pi}{+j} = n \frac{n_{i+}}{n}\frac{n_{+j}}{n} = \frac{n_{i+}n_{+j}}{n}. ]
Here we have a collection of $r$ subpopulations and we observe simple random samples from each subpopulation (it does not matter whether rows or columns represent the different subpopulations). The hypotheses of interest to us now are are [ H_{0}: \text{the row distributions } (\pi_{i1}, \pi_{i2},\ldots\pi_{ic}) \text{ are all the same,} ] for (i=1,\ldots,r), versus [ H_{a}: \text{at least one row is different from the others.} ] The expected counts are calculated with the same formula, and the test is conducted the same way, but the interpretation is different.
We have our data from above
x = xtabs(Freq ~ Sex + Eye, data = HairEyeColor) x
We can compare the row distributions with the following sequence of steps.
addmargins(x, margin = 2) prop.table(x, margin = 1)
The row distributions look similar to each other. Now we compare the column distributions.
addmargins(x, margin = 1) prop.table(x, margin = 2)
Again, the column distributions look very similar. We expect the chi-square test not to be significant.
chisq.test(x)
Here is a visual display of same.
mosaic(x, shade = TRUE)
assoc(x, shade = TRUE)
Now let us see if hair color is related to eye color. Here are the data.
y = xtabs(Freq ~ Hair + Eye, data = HairEyeColor) addmargins(y, margin = 2) prop.table(y, margin = 1)
The row distributions are not similar at all (maybe brown and red, but not the other rows).
addmargins(y, margin = 1) prop.table(y, margin = 2)
Neither are the column distributions. We expect the chi-square statistic to be significant.
chisq.test(y)
And here is a visual display of the same. Blue means higher observed frequency than would be expected under independence, red means lower than expected frequency.
mosaic(y, shade = TRUE)
assoc(y, shade = TRUE)
Odds in favor of (A) is [ \text{Odds in favor of event }A = \frac{\Pr(A)}{1 - \Pr(A)} ]
Here we have an underlying model that looks like this: [ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & \pi_{11} & \pi_{12} & \pi_{1+}\ \text{Group 2} & \pi_{21} & \pi_{22} & \pi_{2+}\ \text{Total} & \pi_{+1} & \pi_{+2} & 1 \end{array} ]
The odds ratio associated with this model is [ \psi = \frac{\pi_{11}/\pi_{12}}{\pi_{21}/\pi_{22}} = \frac{\pi_{11}\pi_{22}}{\pi_{21}\pi_{12}}, ] and it represents the odds of Success for Group 1 divided by the odds of Success for Group 2. We observe a sample and construct a sample contingency table which we write like this: [ \begin{array}{cccc} & \text{Success} & \text{Failure} & \text{Total}\ \text{Group 1} & n_{11} & n_{12} & n_{1+}\ \text{Group 2} & n_{21} & n_{22} & n_{2+}\ \text{Total} & n_{+1} & n_{+2} & n \end{array} ] Then the sample odds ratio (we usually just call it the odds ratio) is [ \hat{\psi} = \frac{n_{11}/n_{12}}{n_{21}/n_{22}} = \frac{n_{11}n_{22}}{n_{21}n_{12}}. ]
We can compute the odds ratio by hand, or use the oddsratio
function in the vcd
package. Here is an example table:
x = xtabs(Freq ~ Sex + Survived, data = Titanic) x
library(vcd) oddsratio(x, log = FALSE)
This means that the odds in favor of death were over 10 times higher for males than the odds for females (for the Titanic disaster).
We can make a visual display with a fourfold
plot.
fourfold(x)
The distribution of (\ln(\hat{\psi})) is asymptotically normal. [ \ln(\hat{\psi}) \overset{\cdot}{\sim} N\left( \ln(\psi),\, \frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}\right), ] for $n$ sufficiently large.
This information can be used to construct a large sample confidence interval for (\psi).
confint(oddsratio(x, log = FALSE), level = 0.95)
For this class of problems we have a binary predictor $X$ (like treatment/placebo) and a categorical response $Y$ (like success/failure) but we also have another categorical variable $Z$ (like democrat, republican, independent...) against which we would like to control. We assume that the odds ratios between $X$ and $Y$ are the same at each level of $Z$, and we would like to know whether that common odds ratio (\psi \neq 1).
The CMH test statistic is [ \text{CMH} = \frac{\left[ \sum_{k}\left(n_{11k} - \frac{n_{1+k}n_{+1k}}{n_{++k}} \right)\right]^{2}}{\sum_{k} \frac{n_{1+k}n_{2+k}n_{+1k}n_{+2k}}{n_{++k}^{2}(n_{++k} - 1)}}. ]
The test statistic has an asymptotic chi-square distribution with one degree of freedom when the common odds ratio (\psi = 1).
MH (1959) proposed a continuity correction in their original paper, and if used, the p-value better approximates an exact conditional test, but it tends to be conservative (Agresti).
Note that if there is evidence that the odds ratio is different for the different levels of $Z$, then the CMH test is not appropriate. We can check for evidence of this with a fourfold
plot and/or the woolf_test
, both from package vcd
.
The following data are from the Titanic tragedy.
x = xtabs(Freq ~ Sex + Age + Class, data = Titanic) ftable(x)
We check for similarity of odds ratios. Here are the different odds ratios:
oddsratio(x, log = FALSE)
The estimates vary in some cases, but let us look to see if that variation is significant.
library(vcd) woolf_test(x)
No problems there, and here is a graphical display.
fourfold(x)
The graph looks good. We go on to perform the CMH test.
mantelhaen.test(x)
The marginal odds ratio of Sex
versus Age
is significantly different from one (the common odds ratio is estimated to be approximately R mantelhaen.test(x)$estimate
). We estimate the odds of being a male child at approximately half the odds of being a female child, consistently across social status.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.