In MalteThodberg/Shenzen2016: Datasets for Introduction to Advanced Topics in Bioinformatics 2016

Teachers and slides

Malte Thodberg
PhD Student ~ 1 year
Focus: Clinical Bioinformatics
Mail: malte.thodberg@bio.ku.dk
Background: BSc Molecular Biomedicine & MSc Bioinformatics

Practical Outline

The plan:

Introduction to model.matrix for building model matrices
Example analysis on a single dataset
Play around with the remaining datasets.

Datasets

Example datasets:

iris dataset from datasets (default R-package).
oliveoil dataset from the pdfCluster package.

Real datasets:

zebrafish dataset
fission dataset
pasilla dataset

Advanced dataset for playing around:

tissues dataset

Design and model matrices

R is a statistical programming language

R has a powerful built-in language for linear models, making it simple to form a model matrix (0-1) representation of a design matrix.

This is done by using the model.matrix function with the build in formula language (preceded by ~).

Let's make a simple dummy design we can explore

Remember, this is how the design matrix for the zebrafish dataset looks:

dummy <- data.frame(factor1=c("A", "A", "B", "B", "A", "A", "B", "B"),
                                         factor2=c("C", "C", "C", "C", "D", "D", "D", "D"))

dummy

Basic model

Simple 2-factor design build using ~:

model.matrix(~factor1, data=dummy)

Adding more groups

3-factor design using +:

model.matrix(~factor1+factor2, data=dummy)

Interaction terms

An an interaction term using ::

model.matrix(~factor1+factor2+factor1:factor2, data=dummy)

Interaction terms

* is shorthand for both base levels and interactions:

model.matrix(~factor1*factor2, data=dummy)

Formula cheat sheet.

The R formula interface is almost a programming language in itself!

Due to it's expressiveness, it's used by huge number of packages, making it well worth the effort to learn.

See this cheatsheet for an overview:

https://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html

First steps with edgeR

Trimming and normalizing the data.

Just as before, we started by loading, trimming and normalizing the data:

# Load the data
library(Shenzen2016)
library(ggplot2)
library(edgeR)
data("zebrafish")

# Trim 
EM_zebra <- subset(zebrafish$Expression, rowSums(zebrafish$Expression >= 5) >= 3)

# Calculate normalization factors
dge_zebra <- DGEList(EM_zebra)
dge_zebra <- calcNormFactors(dge_zebra, method="TMM")

Building the model matrix

Before we can estimate dispersions, we must decide on our model matrix. Recall the simpel design of the zebrafish dataset:

zebrafish$Design

We use model.matrix to make a simple model matrix:

mod <- model.matrix(~gallein, data=zebrafish$Design)

Estimating the dispersion

Now we move on to estimating dispersion or BCV:

disp_zebra <- estimateDisp(dge_zebra, design=mod, robust=TRUE)

The 3-step estimation is all done by a single function in the newest version of edgeR: estimateDisp, which replaces the three previous functions of

estimateCommonDisp
estimateTrendedDisp
estimateTagwiseDisp

Plotting the dispersion

We can then plot the estimates:

plotBCV(y=disp_zebra)

Fiting gene-wise GLMs and testing a coefficient.

Now we can fit gene-wise GLMs:

fit_zebra <- glmFit(disp_zebra, design=mod)

Finally, we can test a coefficient in the model for DE:

# Using the name of the coefficient
gallein <- glmLRT(fit_zebra, coef="galleintreated")

# Using the index of the coefficient
gallein <- glmLRT(fit_zebra, coef=2)

Inspecting the results

edgeR has several useful functions for inspecting and plotting the results:

Inspecting the top hits with topTags:

topTags(gallein)

Inspecting the results

Summarize the number of up- and down-regulated genes with decideTestsDGE:

is_de <- decideTestsDGE(gallein, p.value=0.1)
head(is_de)
summary(is_de)

Inspecting the results

MA-plot, showing relation between mean expression and logFC, :

plotSmear(gallein, de.tags=rownames(gallein)[is_de != 0])

Inspecting the results

There is no built-in function for making a volcano plot, but it's easy to do yourself:

res <- as.data.frame(topTags(gallein, n=Inf, sort.by="none"))
ggplot(res, aes(x=logFC, y=-log10(PValue), color=FDR < 0.05)) + geom_point()

Quasi-likelihood: Fitting models

Recently, edgeR introduced an alternative DE pipeline using Quasi-likelihood. It should be slightly more conservative than the standard edgeR pipeline. Code-wise, though, they are almost identical:

fitQL <- glmQLFit(y=disp_zebra, design=mod, robust=TRUE)
galleinQL <- glmQLFTest(fitQL, coef="galleintreated")
summary(decideTestsDGE(galleinQL, p.value=0.1))

Quasi-likehood: Plotting dispersions

The QL-pipeline estimates slightly different dispersions:

plotQLDisp(fitQL)

Final exercises:

Use the above code as a template for conducting your own analysis. For the yeast and pasilla datasets:

Load, trim and normalize the data
Inspect the study design, and decide on a model matrix to use with edgeR
Use edgeR to estimate dispersions and fit gene-wise GLMs
Report the number of up- and down-regulated genes for the tests of interests
Generate smear and volcano plots summarizing the analysis.

If you have time:

Investigate the effect of different normalization methods.
Try out the QL-pipeline by replacing glmFit with glmQLFit and glmLRT with glmQLFTest
Refering the slides, see if you can also test a contrast by using a contrast-vector of the form: c(0,0,-1,1)

Once again - the next couple of slides contain cheat sheets.

Cheatsheet: Experimenting with model matrices

# Load the data
data("yeast")
data("pasilla")

# Set a datasets
dataset <- pasilla

# Experiment with model matrices
mod <- model.matrix(~condition, data=dataset$Design)

# Look how the experiment is designed
dataset$Design

Cheatsheet: Fitting the models

# Trim 
EM <- subset(dataset$Expression, rowSums(dataset$Expression >= 5) >= 3)

# Calculate normalization factors
dge <- calcNormFactors(DGEList(EM), method="TMM")

# Calculate dispersion
disp <- estimateDisp(dge, design=mod, robust=TRUE)

# Fit models
fit <- glmFit(disp, design=mod)

# Perform the test of the given coefficient
res <- glmLRT(fit, coef="conditionuntreated")

Cheatsheet: Inspecting the results

# Dispersion
plotBCV(disp)

# topTages
topTags(res)

# Up and down regulated genes
summary(decideTestsDGE(res))

# Smear plot
plotSmear(res, de.tags=rownames(res)[decideTestsDGE(res) != 0])

# Volcano
all_genes <- as.data.frame(topTags(res, n=Inf, sort.by="none"))
ggplot(all_genes, aes(x=logFC, y=-log10(PValue), color=FDR < 0.05)) + geom_point()

MalteThodberg/Shenzen2016 documentation built on May 7, 2019, 2:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MalteThodberg/Shenzen2016
Datasets for Introduction to Advanced Topics in Bioinformatics 2016

In MalteThodberg/Shenzen2016: Datasets for Introduction to Advanced Topics in Bioinformatics 2016

Teachers and slides

Practical Outline

Datasets

Design and model matrices

R is a statistical programming language

Basic model

Adding more groups

Interaction terms

Interaction terms

Formula cheat sheet.

First steps with edgeR

Trimming and normalizing the data.

Building the model matrix

Estimating the dispersion

Plotting the dispersion

Fiting gene-wise GLMs and testing a coefficient.

Inspecting the results

Inspecting the results

Inspecting the results

Inspecting the results

Quasi-likelihood: Fitting models

Quasi-likehood: Plotting dispersions

Final exercises:

Cheatsheet: Experimenting with model matrices

Cheatsheet: Fitting the models

Cheatsheet: Inspecting the results

R Package Documentation

Browse R Packages

We want your feedback!

MalteThodberg/Shenzen2016 Datasets for Introduction to Advanced Topics in Bioinformatics 2016

In MalteThodberg/Shenzen2016: Datasets for Introduction to Advanced Topics in Bioinformatics 2016

Teachers and slides

Practical Outline

Datasets

Design and model matrices

R is a statistical programming language

Basic model

Adding more groups

Interaction terms

Interaction terms

Formula cheat sheet.

First steps with edgeR

Trimming and normalizing the data.

Building the model matrix

Estimating the dispersion

Plotting the dispersion

Fiting gene-wise GLMs and testing a coefficient.

Inspecting the results

Inspecting the results

Inspecting the results

Inspecting the results

Quasi-likelihood: Fitting models

Quasi-likehood: Plotting dispersions

Final exercises:

Cheatsheet: Experimenting with model matrices

Cheatsheet: Fitting the models

Cheatsheet: Inspecting the results

R Package Documentation

Browse R Packages

We want your feedback!

MalteThodberg/Shenzen2016
Datasets for Introduction to Advanced Topics in Bioinformatics 2016