Preamble

This document was generated from a file in the Rmd format. Both the source and target files are under the CC BY-SA license. The source file is versioned online.

Overview

This tutorial aims at explaining contrasts in R using the one-way ANOVA in a completely balanced setting. It draws heavily on the references listed at the end of the document.

In an example, yield as the response variable is regressed on an unordered factor, genotype. A small data set will be simulated, then a first version of the statistical model will be described, then a second, leading to the motivation behind the usage of contrasts. Finally, two different contrasts will be described in details.

Computations are made using R base functions as well as "by hand". But it is also shown how to obtain the required information (incidence matrix, contrast estimates, standard errors, test statistics and p values) using an external package, emmeans:

suppressPackageStartupMessages(library(emmeans))

Simulate some data

Choose the data dimensions

G <- 3 # number of genotypes
R <- 5 # number of replicates per genotype

Set up the data structure

(lev.genos <- paste0("var", 1:G))
(lev.reps <- LETTERS[1:R])
dat <- data.frame(geno=rep(lev.genos, each=R),
                  rep=rep(lev.reps, G),
                  yield=NA,
                  stringsAsFactors=FALSE)
dat

Generate the data

set.seed(12345) # to make the simulation reproducible
N <- nrow(dat) # total number of observations
stopifnot(G == 3)
(true.geno.means <- setNames(c(2, 4, 6), lev.genos))
(true.mean <- mean(true.geno.means))
for(i in 1:N){
  true.geno.mean <- true.geno.means[dat$geno[i]]
  dat$yield[i] <- true.geno.mean
}
sd.error <- 0.8
epsilons <- rnorm(n=N, mean=0, sd=sd.error)
dat$yield <- dat$yield + epsilons
dat$geno <- as.factor(dat$geno)
dat$rep <- as.factor(dat$rep)
dat

Explore the data

Data means

str(dat)
(grand.mean <- mean(dat$yield))
(geno.means <- tapply(dat$yield, dat$geno, mean))

Plots

hist(dat$yield, col="grey", border="white", las=1)
abline(v=grand.mean, lty=2)
legend("topright", legend="grand mean", lty=2, bty="n")
boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
        main="Boxplots of dat$yield")
abline(h=grand.mean, lty=2)
legend("topleft", legend="grand mean", lty=2, bty="n")

Build the models

The goal of the statistical models here will be to explain the mean of the response variable using the explanatory factor.

Version 1

As a first model, one can choose to explain the mean of the responses directly by the genotype means.

Notations

Likelihood

Scalar way

[ \forall g \in {1,\ldots,G} \text{ and } r \in {1,\ldots,R}, \; y_{gr} = m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) ]

Matrix way

A few other notations:

[ \boldsymbol{y} = X_1 \boldsymbol{m} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) ]

which is equivalent to:

[ \boldsymbol{y} \, | \, X_1, \boldsymbol{m}, \sigma^2 \; \sim \; \mathcal{N}_N(X_1 \boldsymbol{m}, \sigma^2 \text{I}_N) ]

Note that this model has $G$ parameters for the expectation of $\boldsymbol{y}$.

Inference

Effect estimation

Under our assumptions, by minimizing the squared errors or by maximizing the likelihood, one ends up with the same estimator of $\boldsymbol{m}$, noted $\hat{\boldsymbol{m}}$ (see the Gauss-Markov theorem):

[ X_1^T \, X_1 \; \hat{\boldsymbol{m}} \; = \, X_1^T \, \boldsymbol{y} ]

Without going into too many mathematical details, this system of equations can be easily resolved (by inverting $X_1^T X_1$) if the columns of $X_1$ are independent from each other:

[ \hat{\boldsymbol{m}} \; = \, (X_1^T X_1)^{-1} \, X_1^T \, \boldsymbol{y} ]

Importantly, to obtain the estimates of the parameters for the expectation, one first needs to specify a coding scheme for $X_1$. At this stage, it helps writing down the insides of the model vectors/matrices. Here it is using $R=2$:

[ \begin{bmatrix} y_{11} \ y_{12} \ y_{21} \ y_{22} \ \vdots \ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & ... \ 1 & 0 & 0 & ... \ 0 & 1 & 0 & ... \ 0 & 1 & 0 & ... \ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} m_1 \ m_2 \ \vdots \ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \ \epsilon_{12} \ \epsilon_{21} \ \epsilon_{22} \ \vdots \ \epsilon_{GR} \end{bmatrix} ]

Using the rule of matrix multiplication, one can easily retrieve the following equations:

Here is an example in R:

## example with G=3 genotypes and R=2 reps
y <- dat$yield[dat$rep %in% c("A","B")]
(X_1 <- matrix(c(1,0,0,
                 1,0,0,
                 0,1,0,
                 0,1,0,
                 0,0,1,
                 0,0,1),
               nrow=6, ncol=3, byrow=TRUE))
tXX_1 <- t(X_1) %*% X_1
inv_tXX_1 <- solve(tXX_1)
(m.hat <- inv_tXX_1 %*% t(X_1) %*% y)
(epsilons.hat <- y - X_1 %*% m.hat)
mean(epsilons.hat)

Uncertainty quantification

Obtaining an estimate of the parameters is important, but quantifying our uncertainty about it is better.

Here is the variance of the estimator $\hat{\boldsymbol{m}}$:

[ \mathbb{V}(\hat{\boldsymbol{m}}) = (X_1^T X_1)^{-1} \sigma^2 ]

To compute it, one hence needs to estimate the error variance. The estimate of $\sigma^2$ will be noted $S^2$:

[ \mathbb{E}(S^2) = \frac{RSS}{N - r} ]

where $RSS$ is the residual sum of squares ($\sum_i (y_i - \hat{y_i})^2$) and $r$ is the number of independent columns of $X_1$ (called the rank of $X_1$).

Hypothesis testing

Say our null hypothesis is "$m_g = a$". One can reject it at level $l$ if:

[ \frac{|m_g - a|}{\sqrt{\mathbb{V}(\hat{m_g})}} \ge t_{1-l/2, N-r} ]

where $t$ is the Student's distribution.

Version 2

The "cell-mean" model (v1 above) is intuitive, but it can be re-parametrize into another, equivalent model with an intercept.

Likelihood

Simple way

A few other notations:

[ \forall g \in {1,\ldots,G} \text{ and } r \in {1,\ldots,R}, \; y_{gr} = \mu + m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) ]

Matrix way

A few other notations:

[ \boldsymbol{y} = X_2 \boldsymbol{\theta} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) ]

Note that this model has $G+1$ parameters for the expectation of $\boldsymbol{y}$.

Inference

Effect estimation

Let us make the same exercise as before, that is, writing the insides of the model vectors/matrices:

[ \begin{bmatrix} y_{11} \ y_{12} \ y_{21} \ y_{22} \ \vdots \ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 & 0 & ... \ 1 & 1 & 0 & 0 & ... \ 1 & 0 & 1 & 0 & ... \ 1 & 0 & 1 & 0 & ... \ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} \mu \ m_1 \ m_2 \ \vdots \ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \ \epsilon_{12} \ \epsilon_{21} \ \epsilon_{22} \ \vdots \ \epsilon_{GR} \end{bmatrix} ]

Again as above, using the rule of matrix multiplication, one can easily retrieve the following equations:

However, note that the first column of $X_2$ (corresponding to $\mu$) is equal to the sum of all its other columns.

Here is an example in R:

## example with G=3 genotypes and R=2 reps
y <- dat$yield[dat$rep %in% c("A","B")]
(X_2 <- matrix(c(1,1,0,0,
                 1,1,0,0,
                 1,0,1,0,
                 1,0,1,0,
                 1,0,0,1,
                 1,0,0,1),
               nrow=6, ncol=3+1, byrow=TRUE))
kappa(X_2) # the higher, the more singular X is
tXX_2 <- t(X_2) %*% X_2
eigen(tXX_2)$values # the matrix is singular if at least one eigenvalue is zero
try(inv_tXX_2 <- solve(tXX_2))

This lack of independence between the columns of $X_2$ (such a matrix is said to be singular) means that one needs to add constraints (only one in here) to estimate the parameters for the expectation of $y$.

Contrasts

Still a few other notations:

As said above, to deal with the non-independence between columns of the incidence matrix $X$, one needs to add constraint(s). Typically, a constraint takes the form of a linear combination of the regression coefficients:

\begin{equation} \label{eq:contrast_lin_comb} \sum_{j=0}^{G-1} c_j \, \boldsymbol{\alpha}j \; = \; c_0 \times \alpha_0 + c_1 \times \alpha_1 + \ldots + c{G-1} \times \alpha_{G-1} \end{equation}

But it has a specificity, the sum of the coefficients of this linear combination equals zero: $\sum_{j=0}^{G-1} c_j = 0$. Such a linear combination is called a contrast, and this is what will be estimated.

Obviously, one want to be able to go from $\boldsymbol{m}$ to $\boldsymbol{\alpha}$, et reciprocally. Such transformations can be easily seen using matrix calculus:

[ \boldsymbol{\alpha} = C \, \boldsymbol{m} \; \Leftrightarrow \; \boldsymbol{m} = B \, \boldsymbol{\alpha} ]

where $B = C^{-1}$ and both are $G \times G$ matrices.

One typically chooses $B$ so that its first column is made of 1's: $B = [ \boldsymbol{1}G \; B\star ]$.

$\Rightarrow$ The $B_\star$ matrix is $G \times (G-1)$ and is called a coding matrix.

This leads to:

[ \boldsymbol{m} = B \boldsymbol{\alpha} = [ \boldsymbol{1}G \; B\star ] [\alpha_0 \ldots \alpha_{G-1}]^T = \boldsymbol{1}G \mu + B\star \boldsymbol{\alpha}_\star ]

Similarly, $C$ will be partitioned by separating the first row: $C = \begin{bmatrix} \boldsymbol{c}0^T \ C\star^T \end{bmatrix}$.

$\Rightarrow$ The $C_\star$ matrix is $(G-1) \times G$ and is called a contrast matrix.

Its rows are the contrasts we want to estimate: they sum to zero.


CAUTION: in R, the contr.treatment, contr.sum, contr.helmert, contr.poly, contr.diff, etc functions take factor levels as input and return a coding matrix ($B_\star$), not a contrast matrix ($C_\star$)!

Several possible coding schemes exist, each leading to different contrasts, and the interpretation of the contrasts depends on the coding used to obtain them.

Estimation

As a contrast is a linear combination of parameters, say $C_g \, \boldsymbol{m}$ (where $C_g$ is the $g^\text{th}$ row of $C$), one can easily compute its estimate:

[ C_g \, \hat{\boldsymbol{m}} ]

Uncertainty quantification

The variance of a given contrast will be: [ C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T ]

Hypothesis testing

The null hypothesis "$C_g \, \boldsymbol{m} = a$" is rejected a level $l$ if:

[ \frac{|C_g \, \boldsymbol{m} - a|}{\sqrt{C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T}} \ge t_{1-l/2, N-r} ]

Dummy coding (contr.treatment)

Look at the contrasts

In R, this coding is obtained with contr.treatment, which also happens to be the default coding for unordered factors.

getOption("contrasts")
contrasts(dat$geno)
(Bstar.ct <- contr.treatment(levels(dat$geno)))

As explained above, the columns of Bstar ($B_\star$) are clearly not contrasts as they don't sum to zero. But obtaining the contrast matrix is easy:

  1. add a column of 1's to obtain the $B$ matrix,

  2. then inverse it to obtain the $C$ matrix,

  3. and remove the first row to obtain the $C_\star$ matrix.

(B.ct <- cbind(intercept=1, Bstar.ct))
(C.ct <- solve(B.ct))
(Cstar.ct <- C.ct[-1,])

As explained above, the rows of Cstar ($C_\star$) are contrasts and they sum to zero:

apply(Cstar.ct, 1, sum)

Given that $\boldsymbol{\alpha} = C \boldsymbol{m}$, these rows show how to interpret the regression coefficients (the $\alpha$'s) in terms of the factor level means (the $m$'s):

The intercept is the mean of a factor level chosen arbitrarily (the first level by default). It usually makes sense to choose this coding when one level is a natural reference for the others, such as a control.

Fit the model

fit.ct <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.treatment"))
summary(fit.ct)
(alpha.ct.hat <- coef(fit.ct))

Note that all three estimates are deemed significantly different than 0.

Residual sum of squares and estimate of error variance:

(tmp <- anova(fit.ct))
(RSS.ct <- tmp["Residuals", "Sum Sq"])
summary(fit.ct)$sigma
summary(fit.ct)$sigma^2
(r.ct <- fit.ct$rank)
(S2.ct <- RSS.ct / (N - r.ct))

Retrieve the contrast estimates from the data means

Reference level

$\alpha_0 = \mu = m_1$

alpha.ct.hat[1+0]
geno.means[1]
C.ct[1,,drop=FALSE] %*% geno.means

Estimate of the first contrast

$\alpha_1 = m_2 - m_1$

alpha.ct.hat[1+1]
geno.means[2] - geno.means[1]
C.ct[2,,drop=FALSE] %*% geno.means

Estimate of the second contrast

$\alpha_2 = m_3 - m_1$

alpha.ct.hat[1+2]
geno.means[3] - geno.means[1]
C.ct[3,,drop=FALSE] %*% geno.means

As a matrix multiplication

alpha.ct.hat
C.ct %*% geno.means

Knowing the interpretation of the intercept in terms of data means and looking at the boxplots below, it makes sense that all three regression coefficients are significantly different than zero:

boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
        main="Boxplots of dat$yield")
abline(h=alpha.ct.hat[1], lty=2)
legend("topleft", legend="intercept", lty=2, bty="n")

Retrieve the data means from the contrast estimates

As a matrix multiplication:

geno.means
B.ct %*% alpha.ct.hat

Deviation coding (contr.sum)

Look at the contrasts

In R, this coding is obtained with contr.sum.

options(contrasts=c("contr.sum", "contr.poly"))
getOption("contrasts")
contrasts(dat$geno)
(Bstar.cs <- contr.sum(levels(dat$geno)))

As explained above, the columns of Bstar ($B_\star$) are not contrasts (even though here they sum to zero!). But obtaining the contrast matrix is easy:

  1. add a column of 1's to obtain the $B$ matrix,

  2. then inverse it to obtain the $C$ matrix,

  3. and remove the first row to obtain the $C_\star$ matrix.

(B.cs <- cbind(intercept=1, Bstar.cs))
(C.cs <- solve(B.cs))
(Cstar.cs <- C.cs[-1,])

As explained above, the rows of Cstar ($C_\star$) are contrasts and they sum to zero:

apply(Cstar.cs, 1, sum)

Given that $\boldsymbol{\alpha} = C \boldsymbol{m}$, these rows show how to interpret the regression coefficients (the $\alpha$'s) in terms of the factor level means (the $m$'s):

The intercept is the mean of all the factor level means. It usually makes sense to choose this coding when no level is a natural reference for the others.

Even though it is not returned by the lm function below, the contrast of the last factor level is of interest and can be deduced from the others:

Fit the model

fit.cs <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.sum"))
summary(fit.cs)
(alpha.cs.hat <- coef(fit.cs))

Note that only the estimates of the intercept and the first contrast are deemed significantly different than 0.

Residual sum of squares and estimate of error variance:

(tmp <- anova(fit.cs))
(RSS.cs <- tmp["Residuals", "Sum Sq"])
summary(fit.cs)$sigma
summary(fit.cs)$sigma^2
(r.cs <- fit.cs$rank)
(S2.cs <- RSS.cs / (N - r.cs))

Retrieve the contrast estimates from the data means

Reference level

$\alpha_0 = \mu = \frac{1}{G} \sum_{g=1}^G m_g$

alpha.cs.hat[1+0]
mean(geno.means)

Estimate of the first contrast

$\alpha_1 = m_1 - \mu$

alpha.cs.hat[1+1]
geno.means[1] - mean(geno.means)

Estimate of the second contrast

$\alpha_2 = m_2 - \mu$

alpha.cs.hat[1+2]
geno.means[2] - mean(geno.means)

Estimate of the last contrast

$\alpha_3 = m_3 - \mu$

geno.means[3] - mean(geno.means)

As a matrix multiplication

alpha.cs.hat
C.cs %*% geno.means

Knowing the interpretation of the intercept in terms of data means and looking at the boxplots below, it makes sense that all three regression coefficients are significantly different than zero except the one corresponding to var2:

boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
        main="Boxplots of dat$yield")
abline(h=alpha.cs.hat[1], lty=2)
legend("topleft", legend="intercept", lty=2, bty="n")

Retrieve the data means from the contrast estimates

As a matrix multiplication:

geno.means
B.cs %*% alpha.cs.hat

By hand

Same as with lm

In this section, we show how to obtain all the information we want using only low-level functions (incidence matrix, contrast estimates, standard errors, test statistics and p values).

The model.matrix function returns the design/incidence matrix:

X.cs <- model.matrix(yield ~ geno, data=dat,
                     contrasts.arg=list(geno="contr.sum"))
print(X.cs)

But we can also compute it as $X_1 B$:

X.1 <- model.matrix(yield ~ 0 + geno, data=dat,
                    contrasts.arg=list(geno="contr.sum"))
## X.1 <- matrix(data=0, nrow=nrow(dat), ncol=nlevels(dat$geno))
## colnames(X.1) <- levels(dat$geno)
## for(j in 1:ncol(X.1))
##   X.1[,j] <- as.numeric(dat$geno == colnames(X.1)[j])
X.1
B.cs
(X.cs <- X.1 %*% B.cs)

Estimates of regression coefficients:

tXX_cs <- t(X.cs) %*% X.cs
inv_tXX_cs <- solve(tXX_cs)
(alpha.cs.hat <- inv_tXX_cs %*% t(X.cs) %*% dat$yield)

Residual sum of squares and estimate of error variance:

epsilon.cs.hat <- dat$yield - X.cs %*% alpha.cs.hat
(RSS.cs <- t(epsilon.cs.hat) %*% epsilon.cs.hat)
(S2.cs <- RSS.cs / (N - r.cs))
S2.cs <- S2.cs[1,1]

Variance of estimates of regression coefficients:

(var.alpha.cs.hat <- inv_tXX_cs * S2.cs)
(se.alpha.cs.hat <- sqrt(diag(var.alpha.cs.hat)))

Test statistics:

(test.stats.cs <- alpha.cs.hat / se.alpha.cs.hat)

p values:

l <- 0.05
(pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE))
(pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, lower.tail=FALSE))

Custom set of contrasts

Now that we know how to compute the same contrast estimates by hand as with lm, we can decide to compute the estimates for the contrasts corresponding to, say, the second and third factor level, omitting the first one. This is done by carefully making the $B_\star$ matrix, noted here Bstar.cs2 (second version of the coding matrix corresponding to the sum contrast, and then all the same equations apply:

Bstar.cs2 <- matrix(data=c(-1, -1,
                           1, 0,
                           0, 1),
                    byrow=TRUE,
                    nrow=G, ncol=G-1)
rownames(Bstar.cs2) <- lev.genos
(B.cs2 <- cbind(intercept=1, Bstar.cs2))
X.cs2 <- X.1 %*% B.cs2

As expected, we obtain the same estimate for the intercept and the contrast corresponding to the second factor level, but we also now obtain the estimate for the last factor level which we didn't obtain with the usual call of lm above:

tXX_cs2 <- t(X.cs2) %*% X.cs2
inv_tXX_cs2 <- solve(tXX_cs2)
(alpha.cs2.hat <- inv_tXX_cs2 %*% t(X.cs2) %*% dat$yield)

The rest of the information (standard errors, test statistics and p values) are also obtained with the same equations as above:

epsilon.cs2.hat <- dat$yield - X.cs2 %*% alpha.cs2.hat
(RSS.cs2 <- t(epsilon.cs2.hat) %*% epsilon.cs2.hat)
(S2.cs2 <- RSS.cs2 / (N - r.cs))
S2.cs2 <- S2.cs2[1,1]
(var.alpha.cs2.hat <- inv_tXX_cs2 * S2.cs2)
(se.alpha.cs2.hat <- sqrt(diag(var.alpha.cs2.hat)))
(test.stats.cs2 <- alpha.cs2.hat / se.alpha.cs2.hat)
l <- 0.05
(pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE))
(pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, lower.tail=FALSE))

Marginal means

The emmeans package aims at simplifying all this:

(emm <- emmeans(object=fit.cs, specs="geno"))
contrast(object=emm, method="dunnett") # same as contr.treatment
contrast(object=emm, method="eff")     # same as contr.sum

Perspectives

References

Appendix

print(sessionInfo(), locale=FALSE)


timflutre/rutilstimflutre documentation built on Feb. 7, 2024, 8:17 a.m.