Teachers and slides

Practical Outline

The plan:

Datasets

Example datasets:

Real datasets:

Advanced dataset for playing around:

Design and model matrices

R is a statistical programming language

R has a powerful built-in language for linear models, making it simple to form a model matrix (0-1) representation of a design matrix.

This is done by using the model.matrix function with the build in formula language (preceded by ~).

Let's make a simple dummy design we can explore

Remember, this is how the design matrix for the zebrafish dataset looks:

dummy <- data.frame(factor1=c("A", "A", "B", "B", "A", "A", "B", "B"),
                                         factor2=c("C", "C", "C", "C", "D", "D", "D", "D"))

dummy

Basic model

Simple 2-factor design build using ~:

model.matrix(~factor1, data=dummy)

Adding more groups

3-factor design using +:

model.matrix(~factor1+factor2, data=dummy)

Interaction terms

An an interaction term using ::

model.matrix(~factor1+factor2+factor1:factor2, data=dummy)

Interaction terms

* is shorthand for both base levels and interactions:

model.matrix(~factor1*factor2, data=dummy)

Formula cheat sheet.

The R formula interface is almost a programming language in itself!

Due to it's expressiveness, it's used by huge number of packages, making it well worth the effort to learn.

See this cheatsheet for an overview:

https://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html

First steps with edgeR

Trimming and normalizing the data.

Just as before, we started by loading, trimming and normalizing the data:

# Load the data
library(Shenzen2016)
library(ggplot2)
library(edgeR)
data("zebrafish")

# Trim 
EM_zebra <- subset(zebrafish$Expression, rowSums(zebrafish$Expression >= 5) >= 3)

# Calculate normalization factors
dge_zebra <- DGEList(EM_zebra)
dge_zebra <- calcNormFactors(dge_zebra, method="TMM")

Building the model matrix

Before we can estimate dispersions, we must decide on our model matrix. Recall the simpel design of the zebrafish dataset:

zebrafish$Design

We use model.matrix to make a simple model matrix:

mod <- model.matrix(~gallein, data=zebrafish$Design)

Estimating the dispersion

Now we move on to estimating dispersion or BCV:

disp_zebra <- estimateDisp(dge_zebra, design=mod, robust=TRUE)

The 3-step estimation is all done by a single function in the newest version of edgeR: estimateDisp, which replaces the three previous functions of

Plotting the dispersion

We can then plot the estimates:

plotBCV(y=disp_zebra)

Fiting gene-wise GLMs and testing a coefficient.

Now we can fit gene-wise GLMs:

fit_zebra <- glmFit(disp_zebra, design=mod)

Finally, we can test a coefficient in the model for DE:

# Using the name of the coefficient
gallein <- glmLRT(fit_zebra, coef="galleintreated")

# Using the index of the coefficient
gallein <- glmLRT(fit_zebra, coef=2)

Inspecting the results

edgeR has several useful functions for inspecting and plotting the results:

Inspecting the top hits with topTags:

topTags(gallein)

Inspecting the results

Summarize the number of up- and down-regulated genes with decideTestsDGE:

is_de <- decideTestsDGE(gallein, p.value=0.1)
head(is_de)
summary(is_de)

Inspecting the results

MA-plot, showing relation between mean expression and logFC, :

plotSmear(gallein, de.tags=rownames(gallein)[is_de != 0])

Inspecting the results

There is no built-in function for making a volcano plot, but it's easy to do yourself:

res <- as.data.frame(topTags(gallein, n=Inf, sort.by="none"))
ggplot(res, aes(x=logFC, y=-log10(PValue), color=FDR < 0.05)) + geom_point()

Quasi-likelihood: Fitting models

Recently, edgeR introduced an alternative DE pipeline using Quasi-likelihood. It should be slightly more conservative than the standard edgeR pipeline. Code-wise, though, they are almost identical:

fitQL <- glmQLFit(y=disp_zebra, design=mod, robust=TRUE)
galleinQL <- glmQLFTest(fitQL, coef="galleintreated")
summary(decideTestsDGE(galleinQL, p.value=0.1))

Quasi-likehood: Plotting dispersions

The QL-pipeline estimates slightly different dispersions:

plotQLDisp(fitQL)

Final exercises:

Use the above code as a template for conducting your own analysis. For the yeast and pasilla datasets:

If you have time:

Once again - the next couple of slides contain cheat sheets.

Cheatsheet: Experimenting with model matrices

# Load the data
data("yeast")
data("pasilla")

# Set a datasets
dataset <- pasilla

# Experiment with model matrices
mod <- model.matrix(~condition, data=dataset$Design)

# Look how the experiment is designed
dataset$Design

Cheatsheet: Fitting the models

# Trim 
EM <- subset(dataset$Expression, rowSums(dataset$Expression >= 5) >= 3)

# Calculate normalization factors
dge <- calcNormFactors(DGEList(EM), method="TMM")

# Calculate dispersion
disp <- estimateDisp(dge, design=mod, robust=TRUE)

# Fit models
fit <- glmFit(disp, design=mod)

# Perform the test of the given coefficient
res <- glmLRT(fit, coef="conditionuntreated")

Cheatsheet: Inspecting the results

# Dispersion
plotBCV(disp)

# topTages
topTags(res)

# Up and down regulated genes
summary(decideTestsDGE(res))

# Smear plot
plotSmear(res, de.tags=rownames(res)[decideTestsDGE(res) != 0])

# Volcano
all_genes <- as.data.frame(topTags(res, n=Inf, sort.by="none"))
ggplot(all_genes, aes(x=logFC, y=-log10(PValue), color=FDR < 0.05)) + geom_point()


MalteThodberg/Shenzen2016 documentation built on May 7, 2019, 2:09 p.m.