title: "Normalization and RNA-composition" author: "Malte Thodberg" date: "2018-09-06" output: ioslides_presentation: smaller: true highlight: tango transition: faster vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{EDA} %\usepackage[UTF-8]{inputenc}
This presentations presents some background on EM-normalization for library size and RNA-composition,
as wells as some examples on how this is applied in R using the package edgeR
.
Density curves and log-log plots will be used to explore the effects of different normalization methods.
Setup simple EM:
sample1 = c(10, 20, 30, 10, 10, 10) # Library size of 100 counts
sample2 = 2 + sample1 * 2 # Double library size
sample3 = 1 + sample1 * 3 # Triple library size
EM = data.frame(sample1, sample2, sample3)
EM
## sample1 sample2 sample3
## 1 10 22 31
## 2 20 42 61
## 3 30 62 91
## 4 10 22 31
## 5 10 22 31
## 6 10 22 31
Note the different library sizes:
colSums(EM)
## sample1 sample2 sample3
## 90 192 276
TPM scaling:
scale(EM, center=FALSE, scale=colSums(EM)) # Lets forget the M-part for now...
## sample1 sample2 sample3
## [1,] 0.1111111 0.1145833 0.1123188
## [2,] 0.2222222 0.2187500 0.2210145
## [3,] 0.3333333 0.3229167 0.3297101
## [4,] 0.1111111 0.1145833 0.1123188
## [5,] 0.1111111 0.1145833 0.1123188
## [6,] 0.1111111 0.1145833 0.1123188
## attr(,"scaled:scale")
## sample1 sample2 sample3
## 90 192 276
Samples can now be compared directly for analysis!
Introduce DE for some TCs
EM.DE = EM
EM.DE[4:6,2] = EM.DE[4:6,2] * 5
EM.DE[4:6,3] = EM.DE[4:6,3] * 4
EM.DE
## sample1 sample2 sample3
## 1 10 22 31
## 2 20 42 61
## 3 30 62 91
## 4 10 110 124
## 5 10 110 124
## 6 10 110 124
The total RNA content of sample2+3 has increased!
TPM scaling
scale(EM.DE, center=FALSE, scale=colSums(EM.DE))
## sample1 sample2 sample3
## [1,] 0.1111111 0.04824561 0.05585586
## [2,] 0.2222222 0.09210526 0.10990991
## [3,] 0.3333333 0.13596491 0.16396396
## [4,] 0.1111111 0.24122807 0.22342342
## [5,] 0.1111111 0.24122807 0.22342342
## [6,] 0.1111111 0.24122807 0.22342342
## attr(,"scaled:scale")
## sample1 sample2 sample3
## 90 456 555
Non-DE genes are now under-sampled!
This can affect downstream analysis i.e. distance matrix calculations.
dist(t(scale(EM, center=FALSE, scale=colSums(EM))))
## sample1 sample2
## sample2 0.012991866
## sample3 0.004518910 0.008472956
dist(t(scale(EM.DE, center=FALSE, scale=colSums(EM.DE))))
## sample1 sample2
## sample2 0.33260796
## sample3 0.28669731 0.04593348
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.