This document was generated from a file in the Rmd format. Both the source and target files are under the CC BY-SA license. The source file is versioned online.
This tutorial aims at explaining contrasts in R using the one-way ANOVA in a completely balanced setting. It draws heavily on the references listed at the end of the document.
In an example, yield
as the response variable is regressed on an unordered factor, genotype
.
A small data set will be simulated, then a first version of the statistical model will be described, then a second, leading to the motivation behind the usage of contrasts.
Finally, two different contrasts will be described in details.
Computations are made using R base functions as well as "by hand".
But it is also shown how to obtain the required information (incidence matrix, contrast estimates, standard errors, test statistics and p values) using an external package, emmeans
:
suppressPackageStartupMessages(library(emmeans))
G <- 3 # number of genotypes R <- 5 # number of replicates per genotype
(lev.genos <- paste0("var", 1:G)) (lev.reps <- LETTERS[1:R]) dat <- data.frame(geno=rep(lev.genos, each=R), rep=rep(lev.reps, G), yield=NA, stringsAsFactors=FALSE) dat
set.seed(12345) # to make the simulation reproducible N <- nrow(dat) # total number of observations stopifnot(G == 3) (true.geno.means <- setNames(c(2, 4, 6), lev.genos)) (true.mean <- mean(true.geno.means)) for(i in 1:N){ true.geno.mean <- true.geno.means[dat$geno[i]] dat$yield[i] <- true.geno.mean } sd.error <- 0.8 epsilons <- rnorm(n=N, mean=0, sd=sd.error) dat$yield <- dat$yield + epsilons dat$geno <- as.factor(dat$geno) dat$rep <- as.factor(dat$rep) dat
str(dat) (grand.mean <- mean(dat$yield)) (geno.means <- tapply(dat$yield, dat$geno, mean))
hist(dat$yield, col="grey", border="white", las=1) abline(v=grand.mean, lty=2) legend("topright", legend="grand mean", lty=2, bty="n")
boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE, main="Boxplots of dat$yield") abline(h=grand.mean, lty=2) legend("topleft", legend="grand mean", lty=2, bty="n")
The goal of the statistical models here will be to explain the mean of the response variable using the explanatory factor.
As a first model, one can choose to explain the mean of the responses directly by the genotype means.
$G$: total number of genotypes
$R$: total number of replicates (here, the same number for each genotype)
$g$: index of the genotype
$r$: index of the replicate
$y_{gr}$: yield for replicate $r$ of genotype $g$
$m_g$: mean yield of genotype $g$
$\epsilon_{gr}$: error for replicate $r$ of genotype $g$
$\sigma^2$: variance of the errors
[ \forall g \in {1,\ldots,G} \text{ and } r \in {1,\ldots,R}, \; y_{gr} = m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) ]
A few other notations:
$N$: total number of observations (here, $N = G \times R$)
$i \in {1,\ldots,N}$: index of the observation
$\boldsymbol{y}$: $N$-vector of observations
$X_1$: $N \times G$ design/incidence matrix relating observations to genotypes
$\boldsymbol{m} = {m_1,\ldots,m_G}$: $G$-vector of parameters explaining the mean of the response variable
$\mathcal{N}_N$: multivariate Normal distribution of dimension $N$
$\text{I}_N$: $N \times N$ identity matrix
[ \boldsymbol{y} = X_1 \boldsymbol{m} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) ]
which is equivalent to:
[ \boldsymbol{y} \, | \, X_1, \boldsymbol{m}, \sigma^2 \; \sim \; \mathcal{N}_N(X_1 \boldsymbol{m}, \sigma^2 \text{I}_N) ]
Note that this model has $G$ parameters for the expectation of $\boldsymbol{y}$.
Under our assumptions, by minimizing the squared errors or by maximizing the likelihood, one ends up with the same estimator of $\boldsymbol{m}$, noted $\hat{\boldsymbol{m}}$ (see the Gauss-Markov theorem):
[ X_1^T \, X_1 \; \hat{\boldsymbol{m}} \; = \, X_1^T \, \boldsymbol{y} ]
Without going into too many mathematical details, this system of equations can be easily resolved (by inverting $X_1^T X_1$) if the columns of $X_1$ are independent from each other:
[ \hat{\boldsymbol{m}} \; = \, (X_1^T X_1)^{-1} \, X_1^T \, \boldsymbol{y} ]
Importantly, to obtain the estimates of the parameters for the expectation, one first needs to specify a coding scheme for $X_1$. At this stage, it helps writing down the insides of the model vectors/matrices. Here it is using $R=2$:
[ \begin{bmatrix} y_{11} \ y_{12} \ y_{21} \ y_{22} \ \vdots \ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & ... \ 1 & 0 & 0 & ... \ 0 & 1 & 0 & ... \ 0 & 1 & 0 & ... \ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} m_1 \ m_2 \ \vdots \ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \ \epsilon_{12} \ \epsilon_{21} \ \epsilon_{22} \ \vdots \ \epsilon_{GR} \end{bmatrix} ]
Using the rule of matrix multiplication, one can easily retrieve the following equations:
$y_{11} = m_1 + \epsilon_{11}$
$y_{12} = m_1 + \epsilon_{12}$
$y_{21} = m_2 + \epsilon_{21}$
...
Here is an example in R:
## example with G=3 genotypes and R=2 reps y <- dat$yield[dat$rep %in% c("A","B")] (X_1 <- matrix(c(1,0,0, 1,0,0, 0,1,0, 0,1,0, 0,0,1, 0,0,1), nrow=6, ncol=3, byrow=TRUE)) tXX_1 <- t(X_1) %*% X_1 inv_tXX_1 <- solve(tXX_1) (m.hat <- inv_tXX_1 %*% t(X_1) %*% y) (epsilons.hat <- y - X_1 %*% m.hat) mean(epsilons.hat)
Obtaining an estimate of the parameters is important, but quantifying our uncertainty about it is better.
Here is the variance of the estimator $\hat{\boldsymbol{m}}$:
[ \mathbb{V}(\hat{\boldsymbol{m}}) = (X_1^T X_1)^{-1} \sigma^2 ]
To compute it, one hence needs to estimate the error variance. The estimate of $\sigma^2$ will be noted $S^2$:
[ \mathbb{E}(S^2) = \frac{RSS}{N - r} ]
where $RSS$ is the residual sum of squares ($\sum_i (y_i - \hat{y_i})^2$) and $r$ is the number of independent columns of $X_1$ (called the rank of $X_1$).
Say our null hypothesis is "$m_g = a$". One can reject it at level $l$ if:
[ \frac{|m_g - a|}{\sqrt{\mathbb{V}(\hat{m_g})}} \ge t_{1-l/2, N-r} ]
where $t$ is the Student's distribution.
The "cell-mean" model (v1 above) is intuitive, but it can be re-parametrize into another, equivalent model with an intercept.
A few other notations:
[ \forall g \in {1,\ldots,G} \text{ and } r \in {1,\ldots,R}, \; y_{gr} = \mu + m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) ]
A few other notations:
$X_2$: $N \times (G+1)$ design/incidence matrix relating observations (the $y_{gr}$'s) to genotypes
$\boldsymbol{\theta} = {\mu,m_1,\ldots,m_G}$: $(G+1)$-vector of parameters explaining the mean of the response variable
[ \boldsymbol{y} = X_2 \boldsymbol{\theta} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) ]
Note that this model has $G+1$ parameters for the expectation of $\boldsymbol{y}$.
Let us make the same exercise as before, that is, writing the insides of the model vectors/matrices:
[ \begin{bmatrix} y_{11} \ y_{12} \ y_{21} \ y_{22} \ \vdots \ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 & 0 & ... \ 1 & 1 & 0 & 0 & ... \ 1 & 0 & 1 & 0 & ... \ 1 & 0 & 1 & 0 & ... \ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} \mu \ m_1 \ m_2 \ \vdots \ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \ \epsilon_{12} \ \epsilon_{21} \ \epsilon_{22} \ \vdots \ \epsilon_{GR} \end{bmatrix} ]
Again as above, using the rule of matrix multiplication, one can easily retrieve the following equations:
$y_{11} = \mu + m_1 + \epsilon_{11}$
$y_{12} = \mu + m_1 + \epsilon_{12}$
$y_{21} = \mu + m_2 + \epsilon_{21}$
...
However, note that the first column of $X_2$ (corresponding to $\mu$) is equal to the sum of all its other columns.
Here is an example in R:
## example with G=3 genotypes and R=2 reps y <- dat$yield[dat$rep %in% c("A","B")] (X_2 <- matrix(c(1,1,0,0, 1,1,0,0, 1,0,1,0, 1,0,1,0, 1,0,0,1, 1,0,0,1), nrow=6, ncol=3+1, byrow=TRUE)) kappa(X_2) # the higher, the more singular X is tXX_2 <- t(X_2) %*% X_2 eigen(tXX_2)$values # the matrix is singular if at least one eigenvalue is zero try(inv_tXX_2 <- solve(tXX_2))
This lack of independence between the columns of $X_2$ (such a matrix is said to be singular) means that one needs to add constraints (only one in here) to estimate the parameters for the expectation of $y$.
Still a few other notations:
$\boldsymbol{\alpha}$: $G$-vector of regression coefficients
the first element corresponds to the intercept: $\alpha_0 := \mu$
the others correspond to a $(G-1)$-vector noted $\boldsymbol{\alpha}_\star$
${c_0,\ldots,c_{G-1}}$: set of $G$ real numbers
As said above, to deal with the non-independence between columns of the incidence matrix $X$, one needs to add constraint(s). Typically, a constraint takes the form of a linear combination of the regression coefficients:
\begin{equation} \label{eq:contrast_lin_comb} \sum_{j=0}^{G-1} c_j \, \boldsymbol{\alpha}j \; = \; c_0 \times \alpha_0 + c_1 \times \alpha_1 + \ldots + c{G-1} \times \alpha_{G-1} \end{equation}
But it has a specificity, the sum of the coefficients of this linear combination equals zero: $\sum_{j=0}^{G-1} c_j = 0$. Such a linear combination is called a contrast, and this is what will be estimated.
Obviously, one want to be able to go from $\boldsymbol{m}$ to $\boldsymbol{\alpha}$, et reciprocally. Such transformations can be easily seen using matrix calculus:
[ \boldsymbol{\alpha} = C \, \boldsymbol{m} \; \Leftrightarrow \; \boldsymbol{m} = B \, \boldsymbol{\alpha} ]
where $B = C^{-1}$ and both are $G \times G$ matrices.
One typically chooses $B$ so that its first column is made of 1's: $B = [ \boldsymbol{1}G \; B\star ]$.
$\Rightarrow$ The $B_\star$ matrix is $G \times (G-1)$ and is called a coding matrix.
This leads to:
[ \boldsymbol{m} = B \boldsymbol{\alpha} = [ \boldsymbol{1}G \; B\star ] [\alpha_0 \ldots \alpha_{G-1}]^T = \boldsymbol{1}G \mu + B\star \boldsymbol{\alpha}_\star ]
Similarly, $C$ will be partitioned by separating the first row: $C = \begin{bmatrix} \boldsymbol{c}0^T \ C\star^T \end{bmatrix}$.
$\Rightarrow$ The $C_\star$ matrix is $(G-1) \times G$ and is called a contrast matrix.
Its rows are the contrasts we want to estimate: they sum to zero.
CAUTION: in R, the contr.treatment
, contr.sum
, contr.helmert
, contr.poly
, contr.diff
, etc functions take factor levels as input and return a coding matrix ($B_\star$), not a contrast matrix ($C_\star$)!
Several possible coding schemes exist, each leading to different contrasts, and the interpretation of the contrasts depends on the coding used to obtain them.
As a contrast is a linear combination of parameters, say $C_g \, \boldsymbol{m}$ (where $C_g$ is the $g^\text{th}$ row of $C$), one can easily compute its estimate:
[ C_g \, \hat{\boldsymbol{m}} ]
The variance of a given contrast will be: [ C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T ]
The null hypothesis "$C_g \, \boldsymbol{m} = a$" is rejected a level $l$ if:
[ \frac{|C_g \, \boldsymbol{m} - a|}{\sqrt{C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T}} \ge t_{1-l/2, N-r} ]
contr.treatment
)In R, this coding is obtained with contr.treatment
, which also happens to be the default coding for unordered factors.
getOption("contrasts") contrasts(dat$geno) (Bstar.ct <- contr.treatment(levels(dat$geno)))
As explained above, the columns of Bstar
($B_\star$) are clearly not contrasts as they don't sum to zero.
But obtaining the contrast matrix is easy:
add a column of 1's to obtain the $B$ matrix,
then inverse it to obtain the $C$ matrix,
and remove the first row to obtain the $C_\star$ matrix.
(B.ct <- cbind(intercept=1, Bstar.ct)) (C.ct <- solve(B.ct)) (Cstar.ct <- C.ct[-1,])
As explained above, the rows of Cstar
($C_\star$) are contrasts and they sum to zero:
apply(Cstar.ct, 1, sum)
Given that $\boldsymbol{\alpha} = C \boldsymbol{m}$, these rows show how to interpret the regression coefficients (the $\alpha$'s) in terms of the factor level means (the $m$'s):
$\alpha_0 = 1 \times m_1 + 0 \times m_2 + 0 \times m_3 = m_1$
$\alpha_1 = -1 \times m_1 + 1 \times m_2 + 0 \times m_3 = m_2 - m_1$
$\alpha_2 = -1 \times m_1 + 0 \times m_2 + 1 \times m_3 = m_3 - m_1$
The intercept is the mean of a factor level chosen arbitrarily (the first level by default). It usually makes sense to choose this coding when one level is a natural reference for the others, such as a control.
fit.ct <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.treatment")) summary(fit.ct) (alpha.ct.hat <- coef(fit.ct))
Note that all three estimates are deemed significantly different than 0.
Residual sum of squares and estimate of error variance:
(tmp <- anova(fit.ct)) (RSS.ct <- tmp["Residuals", "Sum Sq"]) summary(fit.ct)$sigma summary(fit.ct)$sigma^2 (r.ct <- fit.ct$rank) (S2.ct <- RSS.ct / (N - r.ct))
$\alpha_0 = \mu = m_1$
alpha.ct.hat[1+0] geno.means[1] C.ct[1,,drop=FALSE] %*% geno.means
$\alpha_1 = m_2 - m_1$
alpha.ct.hat[1+1] geno.means[2] - geno.means[1] C.ct[2,,drop=FALSE] %*% geno.means
$\alpha_2 = m_3 - m_1$
alpha.ct.hat[1+2] geno.means[3] - geno.means[1] C.ct[3,,drop=FALSE] %*% geno.means
alpha.ct.hat
C.ct %*% geno.means
Knowing the interpretation of the intercept in terms of data means and looking at the boxplots below, it makes sense that all three regression coefficients are significantly different than zero:
boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE, main="Boxplots of dat$yield") abline(h=alpha.ct.hat[1], lty=2) legend("topleft", legend="intercept", lty=2, bty="n")
As a matrix multiplication:
geno.means
B.ct %*% alpha.ct.hat
contr.sum
)In R, this coding is obtained with contr.sum
.
options(contrasts=c("contr.sum", "contr.poly")) getOption("contrasts") contrasts(dat$geno) (Bstar.cs <- contr.sum(levels(dat$geno)))
As explained above, the columns of Bstar
($B_\star$) are not contrasts (even though here they sum to zero!).
But obtaining the contrast matrix is easy:
add a column of 1's to obtain the $B$ matrix,
then inverse it to obtain the $C$ matrix,
and remove the first row to obtain the $C_\star$ matrix.
(B.cs <- cbind(intercept=1, Bstar.cs)) (C.cs <- solve(B.cs)) (Cstar.cs <- C.cs[-1,])
As explained above, the rows of Cstar
($C_\star$) are contrasts and they sum to zero:
apply(Cstar.cs, 1, sum)
Given that $\boldsymbol{\alpha} = C \boldsymbol{m}$, these rows show how to interpret the regression coefficients (the $\alpha$'s) in terms of the factor level means (the $m$'s):
$\alpha_0 = \mu = \frac{1}{G} \sum_{g=1}^G m_g$
$\alpha_1 = m_1 - \mu$
$\alpha_2 = m_2 - \mu$
The intercept is the mean of all the factor level means. It usually makes sense to choose this coding when no level is a natural reference for the others.
Even though it is not returned by the lm
function below, the contrast of the last factor level is of interest and can be deduced from the others:
fit.cs <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.sum")) summary(fit.cs) (alpha.cs.hat <- coef(fit.cs))
Note that only the estimates of the intercept and the first contrast are deemed significantly different than 0.
Residual sum of squares and estimate of error variance:
(tmp <- anova(fit.cs)) (RSS.cs <- tmp["Residuals", "Sum Sq"]) summary(fit.cs)$sigma summary(fit.cs)$sigma^2 (r.cs <- fit.cs$rank) (S2.cs <- RSS.cs / (N - r.cs))
$\alpha_0 = \mu = \frac{1}{G} \sum_{g=1}^G m_g$
alpha.cs.hat[1+0] mean(geno.means)
$\alpha_1 = m_1 - \mu$
alpha.cs.hat[1+1] geno.means[1] - mean(geno.means)
$\alpha_2 = m_2 - \mu$
alpha.cs.hat[1+2] geno.means[2] - mean(geno.means)
$\alpha_3 = m_3 - \mu$
geno.means[3] - mean(geno.means)
alpha.cs.hat
C.cs %*% geno.means
Knowing the interpretation of the intercept in terms of data means and looking at the boxplots below, it makes sense that all three regression coefficients are significantly different than zero except the one corresponding to var2
:
boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE, main="Boxplots of dat$yield") abline(h=alpha.cs.hat[1], lty=2) legend("topleft", legend="intercept", lty=2, bty="n")
As a matrix multiplication:
geno.means
B.cs %*% alpha.cs.hat
lm
In this section, we show how to obtain all the information we want using only low-level functions (incidence matrix, contrast estimates, standard errors, test statistics and p values).
The model.matrix
function returns the design/incidence matrix:
X.cs <- model.matrix(yield ~ geno, data=dat, contrasts.arg=list(geno="contr.sum")) print(X.cs)
But we can also compute it as $X_1 B$:
X.1 <- model.matrix(yield ~ 0 + geno, data=dat, contrasts.arg=list(geno="contr.sum")) ## X.1 <- matrix(data=0, nrow=nrow(dat), ncol=nlevels(dat$geno)) ## colnames(X.1) <- levels(dat$geno) ## for(j in 1:ncol(X.1)) ## X.1[,j] <- as.numeric(dat$geno == colnames(X.1)[j]) X.1 B.cs (X.cs <- X.1 %*% B.cs)
Estimates of regression coefficients:
tXX_cs <- t(X.cs) %*% X.cs inv_tXX_cs <- solve(tXX_cs) (alpha.cs.hat <- inv_tXX_cs %*% t(X.cs) %*% dat$yield)
Residual sum of squares and estimate of error variance:
epsilon.cs.hat <- dat$yield - X.cs %*% alpha.cs.hat (RSS.cs <- t(epsilon.cs.hat) %*% epsilon.cs.hat) (S2.cs <- RSS.cs / (N - r.cs)) S2.cs <- S2.cs[1,1]
Variance of estimates of regression coefficients:
(var.alpha.cs.hat <- inv_tXX_cs * S2.cs) (se.alpha.cs.hat <- sqrt(diag(var.alpha.cs.hat)))
Test statistics:
(test.stats.cs <- alpha.cs.hat / se.alpha.cs.hat)
p values:
l <- 0.05 (pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE)) (pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, lower.tail=FALSE))
Now that we know how to compute the same contrast estimates by hand as with lm
, we can decide to compute the estimates for the contrasts corresponding to, say, the second and third factor level, omitting the first one.
This is done by carefully making the $B_\star$ matrix, noted here Bstar.cs2
(second version of the coding matrix corresponding to the sum contrast, and then all the same equations apply:
Bstar.cs2 <- matrix(data=c(-1, -1, 1, 0, 0, 1), byrow=TRUE, nrow=G, ncol=G-1) rownames(Bstar.cs2) <- lev.genos (B.cs2 <- cbind(intercept=1, Bstar.cs2)) X.cs2 <- X.1 %*% B.cs2
As expected, we obtain the same estimate for the intercept and the contrast corresponding to the second factor level, but we also now obtain the estimate for the last factor level which we didn't obtain with the usual call of lm
above:
tXX_cs2 <- t(X.cs2) %*% X.cs2 inv_tXX_cs2 <- solve(tXX_cs2) (alpha.cs2.hat <- inv_tXX_cs2 %*% t(X.cs2) %*% dat$yield)
The rest of the information (standard errors, test statistics and p values) are also obtained with the same equations as above:
epsilon.cs2.hat <- dat$yield - X.cs2 %*% alpha.cs2.hat (RSS.cs2 <- t(epsilon.cs2.hat) %*% epsilon.cs2.hat) (S2.cs2 <- RSS.cs2 / (N - r.cs)) S2.cs2 <- S2.cs2[1,1] (var.alpha.cs2.hat <- inv_tXX_cs2 * S2.cs2) (se.alpha.cs2.hat <- sqrt(diag(var.alpha.cs2.hat))) (test.stats.cs2 <- alpha.cs2.hat / se.alpha.cs2.hat) l <- 0.05 (pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE)) (pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, lower.tail=FALSE))
The emmeans
package aims at simplifying all this:
(emm <- emmeans(object=fit.cs, specs="geno")) contrast(object=emm, method="dunnett") # same as contr.treatment contrast(object=emm, method="eff") # same as contr.sum
ANOVA table (aov
) and types of sum of squares
other contrasts: contr.helmert
, etc
two-way ANOVA with interactions
vignette of the codingMatrices
package from Bill Venables
notes of a course on applied linear models devoted to one-way ANOVA from Bo Li (Purdue University)
tutorial on coding systems from the Institute for Digital Reasearch and Education at UCLA
print(sessionInfo(), locale=FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.