This presentation will show the basic usage of edgeR:
# Load the data library(ABC2017) library(ggplot2) theme_set(theme_minimal()) library(edgeR) data("zebrafish")
Just as before, we started by loading, trimming and normalizing the data:
# Trim EM_zebra <- subset(zebrafish$Expression, rowSums(zebrafish$Expression >= 5) >= 3) # Calculate normalization factors dge_zebra <- DGEList(EM_zebra) dge_zebra <- calcNormFactors(dge_zebra, method="TMM")
We use model.matrix
to make a simple model matrix:
mod <- model.matrix(~gallein, data=zebrafish$Design) mod
Now we move on to estimating dispersion or BCV:
disp_zebra <- estimateDisp(dge_zebra, design=mod, robust=TRUE)
The 3-step estimation is all done by a single function in the newest version of edgeR: estimateDisp
, which replaces the three previous functions of
estimateCommonDisp
estimateTrendedDisp
estimateTagwiseDisp
We can then plot the estimates:
plotBCV(y=disp_zebra)
Now we can fit gene-wise GLMs:
fit_zebra <- glmFit(disp_zebra, design=mod)
Finally, we can test a coefficient in the model for DE:
# Using the name of the coefficient gallein <- glmLRT(fit_zebra, coef="galleintreated") # Using the index of the coefficient gallein <- glmLRT(fit_zebra, coef=2)
edgeR has several useful functions for inspecting and plotting the results:
Inspecting the top hits with topTags
:
topTags(gallein)
Summarize the number of up- and down-regulated genes with decideTestsDGE
:
is_de <- decideTestsDGE(gallein, p.value=0.1) head(is_de) summary(is_de)
MA-plot, showing relation between mean expression and logFC, :
plotSmear(gallein, de.tags=rownames(gallein)[is_de != 0])
There is no built-in function for making a volcano plot, but it's easy to do yourself:
plot(-log10(PValue)~logFC, data=topTags(gallein, n=Inf, sort.by = "none"), col=factor(decideTestsDGE(gallein) != 0), pch=16)
Recently, edgeR introduced an alternative DE pipeline using Quasi-likelihood. It should be slightly more conservative than the standard edgeR pipeline. Code-wise, though, they are almost identical:
fitQL <- glmQLFit(y=disp_zebra, design=mod, robust=TRUE) galleinQL <- glmQLFTest(fitQL, coef="galleintreated") summary(decideTestsDGE(galleinQL, p.value=0.1))
The QL-pipeline estimates slightly different dispersions:
plotQLDisp(fitQL)
Use the above code as a template for conducting your own analysis. For the pasilla
dataset:
If you have time:
glmFit
with glmQLFit
and glmLRT
with glmQLFTest
Once again - the next couple of slides contain cheat sheets...
# Load the data data("pasilla") # This will makes "treated the intercept!" mod <- model.matrix(~condition, data=pasilla$Design) mod
# Relevel to make "untreated" intercept pasilla$Design$condition <- relevel(pasilla$Design$condition, "untreated") mod <- model.matrix(~condition, data=pasilla$Design) mod
# Relevel to make "untreated" intercept pasilla$Design$condition <- relevel(pasilla$Design$condition, "untreated") mod_batch <- model.matrix(~condition+type, data=pasilla$Design) mod_batch
Fit and test model without batch correction:
# Trim EM <- subset(pasilla$Expression, rowSums(pasilla$Expression >= 5) >= 3) # Calculate normalization factors dge <- calcNormFactors(DGEList(EM), method="TMM") # Calculate dispersion disp <- estimateDisp(dge, design=mod, robust=TRUE) # Fit models fit <- glmFit(disp, design=mod) # Perform the test of the given coefficient res <- glmLRT(fit, coef="conditiontreated")
Fit and test model without batch correction:
# Calculate dispersion disp_batch <- estimateDisp(dge, design=mod_batch, robust=TRUE) # Fit models fit_batch <- glmFit(disp_batch, design=mod_batch) # Perform the test of the given coefficient res_batch <- glmLRT(fit_batch, coef="conditiontreated")
Plot dispersions
par(mfrow=c(1,2)) plotBCV(disp) plotBCV(disp_batch)
Number of DE genes
summary(decideTestsDGE(res)) summary(decideTestsDGE(res_batch))
Top genes
topTags(res) topTags(res_batch)
MA-plots:
par(mfrow=c(1,2)) plotSmear(res_batch, de.tags=rownames(res)[decideTestsDGE(res) != 0]) plotSmear(res_batch, de.tags=rownames(res_batch)[decideTestsDGE(res_batch) != 0])
Volcano plots:
par(mfrow=c(1,2)) plot(-log10(PValue)~logFC, data=topTags(res, n=Inf, sort.by = "none"), col=factor(decideTestsDGE(res) != 0), pch=16) plot(-log10(PValue)~logFC, data=topTags(res_batch, n=Inf, sort.by = "none"), col=factor(decideTestsDGE(res_batch) != 0), pch=16)
Trying the more conservative QL-pipeline
fitQL_batch <- glmQLFit(y=disp_batch, design=mod_batch, robust=TRUE) testQL_batch <- glmQLFTest(fitQL_batch, coef=2) summary(decideTestsDGE(testQL_batch, p.value=0.05))
The QL-pipeline estimates slightly different dispersions:
plotQLDisp(fitQL)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.