This presentations presents some background on EM-normalization for library size and RNA-composition,
as wells as some examples on how this is applied in R using the package `edgeR`

.

Density curves and log-log plots will be used to explore the effects of different normalization methods.

Setup simple EM:

sample1 = c(10, 20, 30, 10, 10, 10) # Library size of 100 counts sample2 = 2 + sample1 * 2 # Double library size sample3 = 1 + sample1 * 3 # Triple library size EM = data.frame(sample1, sample2, sample3) EM

Note the different library sizes:

```
colSums(EM)
```

TPM scaling:

scale(EM, center=FALSE, scale=colSums(EM)) # Lets forget the M-part for now...

Samples can now be compared directly for analysis!

Introduce DE for some TCs

EM.DE = EM EM.DE[4:6,2] = EM.DE[4:6,2] * 5 EM.DE[4:6,3] = EM.DE[4:6,3] * 4 EM.DE

The total RNA content of sample2+3 has increased!

TPM scaling

scale(EM.DE, center=FALSE, scale=colSums(EM.DE))

Non-DE genes are now under-sampled!

This can affect downstream analysis i.e. distance matrix calculations.

dist(t(scale(EM, center=FALSE, scale=colSums(EM)))) dist(t(scale(EM.DE, center=FALSE, scale=colSums(EM.DE))))

Packages needed for the analysis:

library(ABC2017) library(edgeR) library(ggplot2) theme_set(theme_minimal()) # Make ggplots prettier

We will use the small `zebrafish`

dataset:

data(zebrafish)

The dataset is a list which contains:

- Expression: EM
- Annotation: Annotation
- Design: Information about the samples.

The same format is used for the remaining datasets in the `ABC2017`

package

EM:

head(zebrafish$Expression)

Annotation:

head(zebrafish$Design)

edgeR (via limma) provides the `plotDensities`

function for exploring the effect of normalization

plotDensities(zebrafish$Expression, legend="topright")

That did not look to good! Since the data spans multiple orders of magnitude, we can try with a log-scale instead.

This however brings up the problem of 0 counts - for which `log`

is not defined.

The get around this probelm a small pseudo-count can be added to all counts in the EM. This does not necesarily have to be an integer, and is usually chosen to be between 0.1 and 1.0.

# Pseoducount and log plotDensities(log(zebrafish$Expression+1), legend="topright")

Notice how the lower quartile is zero - this means that we have a large number of genes with very low counts.

Counts with 1-3 counts are not very interesting, since they are likely to be either noise or expressed at biologically irrelevant levels. It's customary to perform som ad-hoc trimming or filtering to remove these prior to analysis.

Here we only keep genes with at least 2 counts in at least 4 samples:

# Trim above_one <- rowSums(zebrafish$Expression > 1) trimmed_em <- subset(zebrafish$Expression, above_one > 3) # Pseoducount and log log_trimmed_em <- log(trimmed_em + 1)

plotDensities(log_trimmed_em, legend="topright")

Now we have a clearer picture of the distribution of counts within each sample. The large difference in distributions shows the need for normalization, before the samples can be compared.

As with everything in R, we do not have to recode everything from scratch. The edgeR package has a function `cpm`

which has implented a large number of normalization methods and log-transformation.

edgeR does this by implementing the use of normalization factors, which is use to rescale the actual library sizes to take into account differences in RNA-composition.

Using edgeR is simple, but first we must save the EM as a `DGEList`

:

# Create DGEList-object from the trimmed em dge <- DGEList(trimmed_em) # Use edgeR to calculate normalization factors dge <- calcNormFactors(object=dge, method="TMM") # calculate log cpm values TMM_em <- cpm(x=dge, log=TRUE, prior.count=1.0) head(TMM_em)

The resulting plot shows a nicer alignment of the main peak:

plotDensities(TMM_em)

Another way of visualizing normalization is via a log-log plot. This is simply a scatterplot with paired expression values for two samples.

Although it only allows for pairwise comparison, it is a nice way to see the effect of normalization and the variance of expression at different levels.

First we consider the (trimmed) log(counts+1)

qplot(data=log_trimmed_em, x=Ctl3, y=Trt13, alpha=I(0.1)) + geom_smooth(method="gam") + geom_abline(color="red")

Compare this with edgeR's TMM normalization:

qplot(data=as.data.frame(TMM_em), x=Ctl3, y=Trt13, alpha=I(0.1)) + geom_smooth(method="gam") + geom_abline(color="red")

Team up and do the following exercises:

- Make a plot of density curves showing the effect of all the different normalization methods implemented by
`edgeR`

- Make a plot that compares the estimated normalization factors for the
`TMM`

and`RLE`

normalization methods. - Investigate the effect of different pseudocount values using log-log plots.

*HINT for 1:* Use `apply`

-family and facets to compare multiple datasets. Remember to include `method="none"`

.

*HINT for 2:* Read the `calcNormFactors`

help file to see where the normalization factors are stored

# Convert to a DGElist dge <- DGEList(trimmed_em) # Normalize using each of four methods edgeR_methods <- c("none", "TMM", "RLE", "upperquartile") dges <- lapply(edgeR_methods, calcNormFactors, object=dge) # Calculate CPMs norms <- lapply(dges, cpm, log=TRUE)

par(mfrow=c(2,2)) mapply(plotDensities, norms, edgeR_methods, MoreArgs=list(group=NULL, col=NULL, legend=FALSE))

# Extract the normalization factors norm_factors <- sapply(dges, function(x) x$samples$norm.factors) colnames(norm_factors) <- edgeR_methods

```
plot(as.data.frame(norm_factors))
```

# Create DGEList-object from the trimmed em dge <- DGEList(trimmed_em) dge <- calcNormFactors(object=dge, method="TMM") # calculate log cpm values TMM_v <- cpm(x=dge, log=TRUE, prior.count=0.1) TMM_w <- cpm(x=dge, log=TRUE, prior.count=1.0) TMM_x <- cpm(x=dge, log=TRUE, prior.count=5.0) TMM_y <- cpm(x=dge, log=TRUE, prior.count=10.0) TMM_z <- cpm(x=dge, log=TRUE, prior.count=20.0)

qplot(data=as.data.frame(TMM_v), x=Ctl3, y=Ctl5, alpha=I(0.1)) + geom_smooth(method="gam") + geom_abline(color="red")

qplot(data=as.data.frame(TMM_w), x=Ctl3, y=Ctl5, alpha=I(0.1)) + geom_smooth(method="gam") + geom_abline(color="red")

qplot(data=as.data.frame(TMM_y), x=Ctl3, y=Ctl5, alpha=I(0.1)) + geom_smooth(method="gam") + geom_abline(color="red")

qplot(data=as.data.frame(TMM_x), x=Ctl3, y=Ctl5, alpha=I(0.1)) + geom_smooth(method="gam") + geom_abline(color="red")

qplot(data=as.data.frame(TMM_z), x=Ctl3, y=Ctl5, alpha=I(0.1)) + geom_smooth(method="gam") + geom_abline(color="red")

MalteThodberg/ABC2017 documentation built on Nov. 18, 2017, 7:51 a.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.