knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE, fig.width = 7, fig.height = 5, fig.align = 'center') required <- c("vcd") if (!all(unlist(lapply(required, function(pkg) requireNamespace(pkg, quietly = TRUE))))) knitr::opts_chunk$set(eval = FALSE)
This vignette provides some R code that is related to some of the content of Chapter 8 of the STAT0002 notes, namely to Contingency tables. It also contains some technical information about classes of R objects and the way in which this affects what R does when we call functions to operate on an object. If this interests you then that's great, but otherwise focus on what the code below does rather than exactly how it works.
We return to data that we considered briefly in the
Chapter 3: Probability article. The object berkeley
is a 3-dimensional array that contains information about applicants to graduate school at UC Berkeley in 1973 for the six largest departments. Use ?UCBAdmissions
for more information. The 3 dimensions of the array correspond to the gender of the applicant (dimension named Gender
), whether or not they were admitted (named Admit
) and a letter code for the department to which they applied (named Dept
). A given entry in berkeley
gives the total number of applicants in the corresponding (Admit
, Gender
, Dept
) category.
In Chapter 3, we viewed these data are relating to a population containing the 4526 people who applied to graduate school at Berkeley in 1973 and did not seek to generalise beyond this population. In other words, we treated the relative frequencies in the various categories as known probabilities. Now, we view these data as a sample of data that may help us to make inferences about the application process at Berkeley in general. We will explore associations between the categorical variables (Admit
, Gender
, Dept
), or perhaps just two of these variables. Note that, following standard statistical terminology, R refers to categorical variables as factors and the possible values of these factors as levels.
library(stat0002)
The berkeley
dataset is a $2 \times 2 \times 6$ contingency table. For each of the 6 departments involved there is a $2 \times 2$ table for variables Admit
and Gender
.
# Find the dimensions of the data dim(berkeley) # What type of R object is berkeley? class(berkeley) # Print the data berkeley
Many objects in R have a class
attribute that that contains a (character) vector of names, perhaps just one name, that describes what type of object it is. This is useful because for some types of object have standard methods
are provided, to perform common tasks like printing, summarising and plotting. In the code above, instead of typing berkeley
we could have typed print(berkeley)
. When we do this R searches for an appropriate way to print to the R Console the object berkeley
, which has class "table"
. R looks for, and finds, a function print.table
to use to do this printing. We will come back to this later, for example when we consider producing plots of contingency table data.
We collapse the 3-way table to a 2-way table by ignoring the values of one of the 3 categorical variables. One way to do this is to use the xtabs
function in the stats
package, which comes as standard when you install R. To use xtabs
we first modify the structure of the data from a table to a data frame using the function as.data.frame
. Then the following call to xtabs
creates a 2-way table in which frequency Freq
is classified by Gender
and Admit
. We have aggregated (summed) Freq
within each Gender-Admit
category, over all the values of Dept
. Similarly, we could ignore Gender
or Admit
to produce a 2-way table for the remaining two variables.
berkdf <- as.data.frame(berkeley) berkdf ga <- xtabs(Freq ~ Gender + Admit, berkdf) ga
If we think about how the variables in this 2-way table may be related we might imagine that Gender
could affect the value of Admit
, that is, the gender of the applicant could affect the probability that the applicant is admitted. That is, Gender
could have a causal effect on Admit
, which is the main variable of interest in this example. In cases like this, we call Admit
the response variable and Gender
the explanatory variable because Gender
may explain variation in the response variable Admit
. It is natural to consider conditional probabilities of the levels of Admit
given the value of Gender
and we may wish to create our plots with this in mind.
Functions are available for calculating the totals and proportions that appear in Section 8.1 of the notes and in the Berkeley example in Chapter 3 of the notes.
# Total number of applicants marginSums(ga) # Number of males and females marginSums(ga, "Gender") # Number of admitted and rejected applicants marginSums(ga, "Admit") # Add the marginal totals to the table addmargins(ga) # Calculate proportions (relative frequencies) proportions(ga) # Row proportions (sum to 1 across the rows) proportions(ga, "Gender") # Column proportions (sum to 1 down the columns) proportions(ga, "Admit")
What class does ga
have?
class(ga)
The object ga
has 2 things in its vector class names: "xtabs"
and "table"
. This means that we have available to us any methods functions that have been created for use on objects of class "xtabs"
or class "table"
. If, for example, we use the code plot(ga)
then, because "xtabs"
appears first in the vector of class names, R looks first for a function called plot.xtabs
. If it does not find a function with this name then it looks for plot.table
. If it finds neither then it uses the function plot.default
. The function plot.default
definitely exists, but because it has not been designed for a specific input object then it might not work. In this case, there is no function plot.xtabs
but there is a function plot.table
. If the table has at least 2 factors then plot.table
produces a plot using the function mosaicplot
in the graphics
package, which also comes as standard when you install R. Let's see what happens if we do this.
plot(ga, main = "Observed frequencies", color = TRUE)
A mosaic plot is produced. First, the plot area is first split into two parts vertically, with the sizes of the parts reflecting the marginal distribution of the first variable (Gender
here). That is, the widths of the rectangles are proportional to the numbers of males and females respectively. We can see that there are more males in the data than females. Then similar splits are made horizontally within each of the vertical parts, with the sizes of the parts determined by the conditional distribution of the second variable (Admit
) conditional on the value of the first variable, that is, conditional on Gender = Male
and Gender = Female
.
Recall that we said earlier that it made sense to consider conditioning on Gender
and this is what has been done in this plot. Therefore, this plot is pretty much as we would like it to be. We may prefer to display the conditional distribution of Admit
given Gender
horizontally in the plot, rather than vertically. The following code achieves this, using the argument dir
, which determines whether we split first in the horizontal or vertical direction. The information in the plot is the same, but cosmetically it is slightly different.
plot(ga, main = "Observed frequencies", color = TRUE, dir = c("h", "v"))
We can see that more applicants are rejected than admitted and that the proportion of males that are admitted is greater than the proportion of females that are admitted.
If we had placed the variables in the data frame in the other order then, unless we make an adjustment, the mosaic plot is produced by conditioning on Admit
first, producing the following plot.
ag <- xtabs(Freq ~ Admit + Gender, berkdf) ag # Alternatively, we could have transposed ga t(ga) plot(ag, main = "Observed frequencies", color = TRUE)
This is not wrong, but it concentrates on conditional probabilities of Gender
given Admit
, which is not what we want. We can use the argument sort
to reverse the order in which the mosaic plots takes the variables and reproduce our preferred plot.
plot(ag, main = "Observed frequencies", color = TRUE, sort = 2:1)
We estimate expected frequencies under the assumption that the variables Gender
and Admit
are independent. We could use R to calculate these for ourselves, using the outer
function below. Look at ?outer
to see what this does. Alternatively, can use the function chisq.test
in the stats
package. We will come back to this function later, but for the moment we only want the values of the estimated expected frequencies that it calculates. We also produce a mosaic plot of the estimated expected frequencies
efreq <- outer(marginSums(ga, "Gender"), marginSums(ga, "Admit")) / marginSums(ga) efreq # Check using chisq.test efreq <- chisq.test(ga)$expected efreq # Trick R into using plot.table class(efreq) <- "table" # Plot estimated expected frequencies plot(efreq, main = "Estimated expected frequencies", color = TRUE, dir = c("h", "v"))
As we expect, the relative sizes of the estimated expected frequencies for Admitted
and Rejected
are the same for males and females. This provides us with a visual illustration of what mosaic plots of observed frequencies should look approximately like if the variables concerned are independent. The horizontal and vertical gaps between the rectangles should look approximately like a grid in which all the horizontal gaps and the vertical gaps are approximately lined up.
The assocplot
function in the graphics
package produces an association plot that summarises how the (Pearson) residuals vary between the combinations of the categories. The vertical extent of a rectangle is proportional to the corresponding Pearson residual and the width is proportional to the square root of the estimated expected frequency. Therefore, the area of a box is proportional to the corresponding (raw) residual, that is, this difference between the observed and estimated expected frequency. Look at the definitions of these residuals in Section 8.1.1 to see how this works.
assocplot(ga, col = c("black", "grey"))
Comparing the rectangles with the horizontal dashed line, we see that more men are admitted than is expected if Gender
and Admit
are independent.
The function chisq.test
in the stats
package can be used to perform the chi-squared outlined at the end of Section 8.1.1 of the notes. We specify correct = FALSE
to produce the same result given in the notes, that is, we do not use the Yates's correction for continuity.
chisq.test(ga, correct = FALSE)
The value of the test statistic $92.205$ is very much larger than expected under the hypothesis that Gender
and Admit
are independent. Therefore, we would reject hypothesis.
vcd
packageThe vcd
package (@vcd) provides various functions to summarise, visualise and make inferences using categorical data. Its functions mosaic
and assoc
produce plots that are equivalent to those produced by plot.table
and assocplot
above. It deals more easily with some aspects of tables of dimension greater than 2 than the functions in base R and offers some more features.
One extra feature of the functions mosaic
and assoc
is the ability to shade the rectangles in an association plot with colours that reflect the size of a residual. This can draw our attention to cells of the contingency table with large residuals and help us to spot patterns. By default, these functions base the shading on the values of the Pearson residuals. This reflects the contributions to the test statistic in the chi-squared test.
We might also like the option to shade based on the values of the standardised Pearson residuals. If the variables Gender
and Admit
are independent then these residuals should look approximately as if they have been sampled from a standard normal distribution. Therefore, values that are greater than 2 in magnitude are unusual - they have an approximate probability of $5%$ occurring - and values that are greater than 4 in magnitude are very surprising. We do this in the code below by providing the values of the standardised Pearson residuals to the functions in the vcd
package, using the argument residuals
. For the mosaic plot this just effects the numbers on the colour key. For the association plots using the standardisation of the Pearson residuals changes the vertical extents of the rectangles in the plots, so now the area of a rectangle is not proportional its residual. In this $2 \times 2$ case where all standardised Pearson residuals have the same magnitude all these vertical extents are equal.
For a technical reason, to do with wanting to change the label on the legend of the plot using residuals_type
, we call the vcd
function strucplot
, which is the plotting function underlying the function assoc
. The function mosaic
also allows us to colour the parts of a moasic plot based on the values of residuals.
library(vcd) # Extract the standardised Pearson residuals x2test <- chisq.test(ga) # Raw residuals x2test$observed - x2test$expected # Pearson residuals x2test$residuals # Standardised Pearson residuals x2test$stdres # Association plot of residuals with Pearson residual shading assoc(ga, shade = TRUE, margins = c(2.25, 1, 1, 2.5)) # Association plot of residuals with standardised Pearson residual shading strucplot(ga, shade = TRUE, residuals = x2test$stdres, margins = c(2.25, 1, 1, 2.5), residuals_type = "Standardised\nPearson\nresiduals", core = struc_assoc, keep_aspect_ratio = FALSE, legend_width = 6) # Mosaic plot with standardised Pearson residual shading mosaic(ga, shade = TRUE, residuals = x2test$stdres, residuals_type = "Standardised Pearson", margins = c(0, 0, 0, 0))
These plots indicate that the (effectively one) standardised Pearson residual has a magnitude ($9.6$) that is very much larger than we expect if Gender
and Admit
are independent. In larger contingency tables, with more cells (combinations of the factors) and/or more than 2 dimensions, we can use colouring like this to draw our attention to patterns in the data and departures from independence.
If we have 3 (or more) variables then there are many possible associations that we could examine. See Section 8.2 of the notes for details.
We produce a moasic plot based on all three variables in the berkeley
dataset. To avoid cluttering the plot with text, we create a new object x
in which the levels of Admit
have been abbreviated.
b <- berkeley dimnames(b)$Admit <- c("A", "R") dimnames(b)$Gender <- c("M", "F") plot(b, main = "Observed frequencies", sort = 3:1, color = c(1, 8))
This plot tells us quite a lot. If we look individually at the parts of the plot relating to departments B to F then we find that Gender
and Admit
look to be approximately independent within each of these departments. Only in department A does Admit
seem to depend on Gender
with a larger proportion of females who apply to this department being admitted than the males.
As we noted in Section 8.2.1 there seems little point in asking whether the 3 variables Gender
, Admit
and Dept
are independent when we have already concluded that Gender
and Admit
are not independent. However, we perform a chi-squared test in any case, this time using the generic function summary
. For an object of class table
this function calls chisq.test
to perform the test and also includes a very basic summary of the table in the output.
summary(berkeley, correction = FALSE)
We examine the association between Dept
and Gender
and then between Admit
and Dept
. We create our own function assoc2
that takes the contingency table tab
and standardised residuals residuals
as arguments, so that we can shorten the code needed to create an association plot with rectangles shaded based on the standardised Pearson residuals.
assoc2 <- function(tab, residuals, ...) { strucplot(tab, shade = TRUE, residuals = residuals, residuals_type = "Standardised Pearson", core = struc_assoc, ...) }
gd <- xtabs(Freq ~ Gender + Dept, berkdf) x2test <- chisq.test(gd, correct = FALSE) assoc2(gd, residuals = x2test$stdres, margins = c(0, 0, 0, 0))
There are some strong differences in the preferences of females and males concerning the departments to which they apply. Males prefer departments A and B and females the other departments, particularly departments C and E.
ad <- xtabs(Freq ~ Admit + Dept, berkdf) x2test <- chisq.test(ad, correct = FALSE) assoc2(ad, residuals = x2test$stdres, margins = c(0, 0, 0, 0))
This plot suggests that the probability of admittance is much greater in departments A and B, which are the departments to which males like to apply. Departments E and F have a relatively low probability of acceptance and females are more likely than males to apply to these departments.
The plots that we have seen provide a possible explanation for the fact that the overall probability of admittance is lower for females than males: the males were more likely to apply to the departments that had the higher probability of admittance.
Now we explore how successful females and males are at being admitted to Berkeley within each of the departments A to F, that is, we condition on the variable Dept
. We use the vcd
function cotabplot
to to produce an association plot for each of the departments A to F. The shading of the rectangles is based on the Pearson residuals.
cotabplot(~ Admit + Gender | Dept, data = berkeley, layout = 3:2, shade = TRUE, panel = cotab_assoc)
We see that only in department A is there a substantial difference between the admittance probability of females and males, with the females doing better than the males. Finally, we focus on department A and change the shading so that it is based on the standardised Pearson residuals, mainly to see that their common magnitude is $4.15$.
# A function to produce an association plot within a given department deptplot <- function(dept) { temp <- xtabs(Freq ~ Admit + Gender, berkdf, subset = berkdf[, "Dept"] == dept) x2test <- chisq.test(temp, correct = FALSE) assoc2(temp, residuals = x2test$stdres, main = paste("Dept ", dept)) return(x2test$stdres) }
deptplot("A")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.