malte.thodberg@bio.ku.dk
The plan:
model.matrix
for building model matricesExample datasets:
iris
dataset from datasets
(default R-package). oliveoil
dataset from the pdfCluster
package.Real datasets:
zebrafish
dataset fission
dataset pasilla
datasetAdvanced dataset for playing around:
tissues
datasetR has a powerful built-in language for linear models, making it simple to form a model matrix (0-1) representation of a design matrix.
This is done by using the model.matrix
function with the build in formula language (preceded by ~
).
Let's make a simple dummy design we can explore
Remember, this is how the design matrix for the zebrafish
dataset looks:
dummy <- data.frame(factor1=c("A", "A", "B", "B", "A", "A", "B", "B"), factor2=c("C", "C", "C", "C", "D", "D", "D", "D")) dummy
Simple 2-factor design build using ~
:
model.matrix(~factor1, data=dummy)
3-factor design using +
:
model.matrix(~factor1+factor2, data=dummy)
An an interaction term using :
:
model.matrix(~factor1+factor2+factor1:factor2, data=dummy)
*
is shorthand for both base levels and interactions:
model.matrix(~factor1*factor2, data=dummy)
The R formula interface is almost a programming language in itself!
Due to it's expressiveness, it's used by huge number of packages, making it well worth the effort to learn.
See this cheatsheet for an overview:
https://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html
Just as before, we started by loading, trimming and normalizing the data:
# Load the data library(Shenzen2016) library(ggplot2) library(edgeR) data("zebrafish") # Trim EM_zebra <- subset(zebrafish$Expression, rowSums(zebrafish$Expression >= 5) >= 3) # Calculate normalization factors dge_zebra <- DGEList(EM_zebra) dge_zebra <- calcNormFactors(dge_zebra, method="TMM")
Before we can estimate dispersions, we must decide on our model matrix. Recall the simpel design of the zebrafish
dataset:
zebrafish$Design
We use model.matrix
to make a simple model matrix:
mod <- model.matrix(~gallein, data=zebrafish$Design)
Now we move on to estimating dispersion or BCV:
disp_zebra <- estimateDisp(dge_zebra, design=mod, robust=TRUE)
The 3-step estimation is all done by a single function in the newest version of edgeR: estimateDisp
, which replaces the three previous functions of
estimateCommonDisp
estimateTrendedDisp
estimateTagwiseDisp
We can then plot the estimates:
plotBCV(y=disp_zebra)
Now we can fit gene-wise GLMs:
fit_zebra <- glmFit(disp_zebra, design=mod)
Finally, we can test a coefficient in the model for DE:
# Using the name of the coefficient gallein <- glmLRT(fit_zebra, coef="galleintreated") # Using the index of the coefficient gallein <- glmLRT(fit_zebra, coef=2)
edgeR has several useful functions for inspecting and plotting the results:
Inspecting the top hits with topTags
:
topTags(gallein)
Summarize the number of up- and down-regulated genes with decideTestsDGE
:
is_de <- decideTestsDGE(gallein, p.value=0.1) head(is_de) summary(is_de)
MA-plot, showing relation between mean expression and logFC, :
plotSmear(gallein, de.tags=rownames(gallein)[is_de != 0])
There is no built-in function for making a volcano plot, but it's easy to do yourself:
res <- as.data.frame(topTags(gallein, n=Inf, sort.by="none")) ggplot(res, aes(x=logFC, y=-log10(PValue), color=FDR < 0.05)) + geom_point()
Recently, edgeR introduced an alternative DE pipeline using Quasi-likelihood. It should be slightly more conservative than the standard edgeR pipeline. Code-wise, though, they are almost identical:
fitQL <- glmQLFit(y=disp_zebra, design=mod, robust=TRUE) galleinQL <- glmQLFTest(fitQL, coef="galleintreated") summary(decideTestsDGE(galleinQL, p.value=0.1))
The QL-pipeline estimates slightly different dispersions:
plotQLDisp(fitQL)
Use the above code as a template for conducting your own analysis. For the yeast
and pasilla
datasets:
If you have time:
glmFit
with glmQLFit
and glmLRT
with glmQLFTest
c(0,0,-1,1)
Once again - the next couple of slides contain cheat sheets.
# Load the data data("yeast") data("pasilla") # Set a datasets dataset <- pasilla # Experiment with model matrices mod <- model.matrix(~condition, data=dataset$Design) # Look how the experiment is designed dataset$Design
# Trim EM <- subset(dataset$Expression, rowSums(dataset$Expression >= 5) >= 3) # Calculate normalization factors dge <- calcNormFactors(DGEList(EM), method="TMM") # Calculate dispersion disp <- estimateDisp(dge, design=mod, robust=TRUE) # Fit models fit <- glmFit(disp, design=mod) # Perform the test of the given coefficient res <- glmLRT(fit, coef="conditionuntreated")
# Dispersion plotBCV(disp) # topTages topTags(res) # Up and down regulated genes summary(decideTestsDGE(res)) # Smear plot plotSmear(res, de.tags=rownames(res)[decideTestsDGE(res) != 0]) # Volcano all_genes <- as.data.frame(topTags(res, n=Inf, sort.by="none")) ggplot(all_genes, aes(x=logFC, y=-log10(PValue), color=FDR < 0.05)) + geom_point()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.