vignettes/Normalization.md

title: "Normalization and RNA-composition" author: "Malte Thodberg" date: "2018-09-06" output: ioslides_presentation: smaller: true highlight: tango transition: faster vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{EDA} %\usepackage[UTF-8]{inputenc}

Introduction

This presentations presents some background on EM-normalization for library size and RNA-composition, as wells as some examples on how this is applied in R using the package edgeR.

Density curves and log-log plots will be used to explore the effects of different normalization methods.

RNA-composition and DE

TPM normalization

Setup simple EM:

sample1 = c(10, 20, 30, 10, 10, 10) # Library size of 100 counts
sample2 = 2 + sample1 * 2 # Double library size
sample3 = 1 + sample1 * 3 # Triple library size
EM = data.frame(sample1, sample2, sample3)

EM
##   sample1 sample2 sample3
## 1      10      22      31
## 2      20      42      61
## 3      30      62      91
## 4      10      22      31
## 5      10      22      31
## 6      10      22      31

Note the different library sizes:

colSums(EM)
## sample1 sample2 sample3 
##      90     192     276

TPM normalization

TPM scaling:

scale(EM, center=FALSE, scale=colSums(EM)) # Lets forget the M-part for now...
##        sample1   sample2   sample3
## [1,] 0.1111111 0.1145833 0.1123188
## [2,] 0.2222222 0.2187500 0.2210145
## [3,] 0.3333333 0.3229167 0.3297101
## [4,] 0.1111111 0.1145833 0.1123188
## [5,] 0.1111111 0.1145833 0.1123188
## [6,] 0.1111111 0.1145833 0.1123188
## attr(,"scaled:scale")
## sample1 sample2 sample3 
##      90     192     276

Samples can now be compared directly for analysis!

CPM normalization

Introduce DE for some TCs

EM.DE = EM
EM.DE[4:6,2] = EM.DE[4:6,2] * 5
EM.DE[4:6,3] = EM.DE[4:6,3] * 4

EM.DE
##   sample1 sample2 sample3
## 1      10      22      31
## 2      20      42      61
## 3      30      62      91
## 4      10     110     124
## 5      10     110     124
## 6      10     110     124

The total RNA content of sample2+3 has increased!

CPM normalization

TPM scaling

scale(EM.DE, center=FALSE, scale=colSums(EM.DE))
##        sample1    sample2    sample3
## [1,] 0.1111111 0.04824561 0.05585586
## [2,] 0.2222222 0.09210526 0.10990991
## [3,] 0.3333333 0.13596491 0.16396396
## [4,] 0.1111111 0.24122807 0.22342342
## [5,] 0.1111111 0.24122807 0.22342342
## [6,] 0.1111111 0.24122807 0.22342342
## attr(,"scaled:scale")
## sample1 sample2 sample3 
##      90     456     555

Non-DE genes are now under-sampled!

CPM normalization

This can affect downstream analysis i.e. distance matrix calculations.

dist(t(scale(EM, center=FALSE, scale=colSums(EM))))
##             sample1     sample2
## sample2 0.012991866            
## sample3 0.004518910 0.008472956
dist(t(scale(EM.DE, center=FALSE, scale=colSums(EM.DE))))
##            sample1    sample2
## sample2 0.33260796           
## sample3 0.28669731 0.04593348


MalteThodberg/ABC2018 documentation built on May 27, 2019, 11:42 a.m.