In UPSCb/RnaSeqTutorial: RNA-Seq Tutorial

    startQuest("Biological QA - Variance Stabilising Transformation")

    h1("Sexual dimorphism study Biological QA")

    library(RnaSeqTutorial)

    h2("(Re)Introduction")

As a running example, we use a dataset derived from a study performed in Populus tremula (Robinson, Delhomme et al., BMC Plant Biology, 2014}. The authors test the evolutionary theory suggesting that males and females may evolve sexually dimorphic phenotypic and biochemical traits concordant with each sex having different optimal strategies of resource investment to maximise reproductive success and fitness. Such sexual dimorphism should result in sex biased gene expression patterns in non-floral organs for autosomal genes associated with the control and development of such phenotypic traits. Hence, the authors, among other approaches have looked for gene expression differences between sex. This was achieved by an RNA-Seq differential expression analysis between samples grouped by sex. The samples have been sequenced on an Illumina HiSeq 2000, using TruSeq adapters, through a 101 cycles paired-end protocol, which yielded between 11 million and 34 million reads per sample.

    h2("Processing the data")

    h3("Reading in the data")

First, we read in the expression quantification values we obtained running the pseudo-alignment software kallisto. The so-called "abundance" files are
loaded using the tximport package

    library(tximport)
    files <- list.files("~/results/kallisto", 
                    pattern = ".*_abundance.tsv",
                    full.names = TRUE)
    tx <- suppressMessages(tximport(files = files,
                                     type = "kallisto", 
                                     txOut = TRUE))

The tx object contains the transcript expression extimates of our 17 samples.

    tx.counts <- round(tx$counts)
    colnames(tx.counts) <- sub("_.*","",basename(files))

Next we can summarise them at the gene level. The P. tremula genes are easy to extract from the transcript names, they simply have an extra dot followed by numbers following the gene identifier.

tx2gene <- data.frame(
    TX=rownames(tx.counts),
    GENEID=sub("\\.\\d+$","",rownames(tx.counts)))

Using this transcript to gene mapping we can now summarise the expression estimate at the gene level.

    count.table <- round(summarizeToGene(tx,tx2gene)$counts)
    colnames(count.table) <- sub("_.*","",basename(files))

We can estimate how many of the genes are not expressed

sel <- rowSums(count.table) == 0
sprintf("%s percent of %s genes are not expressed",round(sum(sel) * 100/ nrow(count.table),digits=1),nrow(count.table))

And similarly for transcripts

sel <- rowSums(tx.counts) == 0
sprintf("%s percent of %s transcripts are not expressed",round(sum(sel) * 100/ nrow(tx.counts),digits=1),nrow(tx.counts))

     h2("Biological QA")

To assess the validity of the replicates, we can look at different metrics.

     h3("Raw count distribution")

We start by looking at the per-gene mean expression, i.e. the mean raw count of every gene across samples is calculated and displayed on a log10 scale

library(RColorBrewer)
pal <- brewer.pal(8,"Dark2")
mar <- par("mar")
plot(density(log10(rowMeans(count.table))),col=pal[1],
     main="gene mean raw counts distribution",
     xlab="mean raw counts (log10)")

Then the same is done for the individual samples colored by sample type

plot.multidensity(log10(count.table),col=rep(pal,each=3),
                  legend.x="topright",legend.cex=0.5,
                  main="sample raw counts distribution",
                  xlab="per gene raw counts (log10)")

The samples show the same trend (since they are a subset of the original data)

plot.multidensity(log10(tx.counts),col=rep(pal,each=3),
                  legend.x="topright",legend.cex=0.5,
                  main="sample raw counts distribution",
                  xlab="per gene raw counts (log10)")

And so do the transcripts' counts. The counts are overall shifted towards the left, as expected since multiple isoforms on average contribute to a gene expression.

     h2("Variance stabilisation for better visualisation")

For visualization, the data is submitted to a variance stabilization transformation using DESeq2. The dispersion is going to be estimated independently of the sample type but we nonetheless define the metadata fo importance, i.e. every sample sex and year of sampling

samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
sex <- samples$sex[match(colnames(count.table),samples$sample)]
date <- factor(samples$date[match(colnames(count.table),samples$sample)])

And then we create the DESeq2 object, using the sample name as condition (which is hence irrelevant to any comparison)

dds <- DESeqDataSetFromMatrix(
  countData = count.table,
  colData = data.frame(sex=sex, date=date),
  design = ~ date + sex)

Once this is done, we first check the size factors and as there is no variation, we decide to go for a variance stabilisation transformation over a regularised log transformation. Check the DESeq2 vignette for more details about this.

dds <- estimateSizeFactors(dds)
sizes <- sizeFactors(dds)
sizes
boxplot(sizes,main="relative library sizes",ylab="scaling factor")

Let us do the vst. Here we do not want to give any prior to the normalisation, so we perform a BLIND transformation. If we were planning to work with the vst data for non-QC analysis, then we WOULD set blind=FALSE to ensure a more adequate dispersion estimation.

vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
vst <- assay(vsd)
colnames(vst) <- colnames(count.table)

The vst introduces an offset, i.e. non expressed gene will have a value set as the minimal value of the vst. To avoid lengthy unnecessary explanations, we simply shift all values so that non expressed genes have a vst value of 0

vst <- vst - min(vst)

It is then essential to validate the vst effect and ensure its validity. To that extend, we visualize the corrected mean - sd relationship. As it is fairly linear, meaning we can assume homoscedasticity. the slight initial trend / bump is due to genes having few counts in a few subset of the samples and hence having a higher variability. This is expected.

meanSdPlot(assay(vsd)[rowSums(count.table)>0,])

Here is what would be observed by a log2 transformation of the raw count data

meanSdPlot(log2(as.matrix(count.table[rowSums(count.table)>0,])))

or for the count adjusted for the library size factor

meanSdPlot(log2(counts(dds,normalized=TRUE)[rowSums(count.table)>0,]))

    quest(2)
    endQuest()
    startQuest("Biological QA - PCA and heatmap")

     h3("PCA analysis")

We perform a Principal Component Analysis (PCA) of the data to do a quick quality assessment, i.e. replicate should cluster and the first dimensions should be explainable by biological means.

pc <- prcomp(t(vst))
percent <- round(summary(pc)$importance[2,]*100)
smpls <- conditions

We color the samples by date and sex

sex.cols<-c("pink","lightblue")
sex.names<-c(F="Female",M="Male")

And then check wheter the observed separation is due to the sample sex

scatterplot3d(pc$x[,1],
              pc$x[,2],
              pc$x[,3],
              xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
              ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
              zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
              color=sex.cols[as.integer(factor(sex))],
              pch=19)
legend("bottomright",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))
par(mar=mar)

This does not seem to be the case at all. There are 2 clusters of points, but both equally contain pink and blue dots. So, we next color by sampling date

dat <- date
scatterplot3d(pc$x[,1],
              pc$x[,2],
              pc$x[,3],
              xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
              ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
              zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
              color=pal[as.integer(factor(dat))],
              pch=19)
legend("topleft",pch=rep(c(19,23),each=10),col=rep(pal,2),legend=levels(factor(dat)),bty="n")
par(mar=mar)

And here we are... The sampling data is a STRONG CONFOUNDING FACTOR in our analysis. So let's do a final plot with the color = sex and shape = date

scatterplot3d(pc$x[,1],
              pc$x[,2],
              pc$x[,3],
              xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
              ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
              zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
              color=sex.cols[as.integer(factor(sex))],
              pch=c(19,17)[as.integer(factor(dat))])
legend("bottomright",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topleft",pch=c(NA,21,24),col=c(NA,1,1),
       legend=c("Symbol:",levels(factor(dat))),cex=0.85)
par(mar=mar)

Sometimes the 3D PCA are not so easy to read and we may be just as well off looking at 2 dimensions at a time. First the 2 first dims

plot(pc$x[,1],
     pc$x[,2],
     xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
     ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
     col=sex.cols[as.integer(factor(sex))],
     pch=c(19,17)[as.integer(factor(dat))],
     main="Principal Component Analysis",sub="variance stabilized counts")
legend("bottomleft",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topright",pch=c(NA,21,24),col=c(NA,1,1),
       legend=c("Symbol:",levels(factor(dat))),cex=0.85)
text(pc$x[,1],  
     pc$x[,2],
     labels=colnames(count.table),cex=.5,adj=-1)

Clearly the sampling date separate most samples. It is interesting to see that the 2 samples sequenced deeper are in between the 2 clusters. We could hypothesize that the environmental change between sampling years are affecting less important genes, hence genes which expression levels are lower; i.e. somewhat looking at transcriptional "noise". Since the 2 parent samples have been sequenced much deeper, the fact that they group with neither cluster might be due to that fact that they too have an increased proportion of transcriptional noise. Looking at the 2nd and 3rd dims does not reveal any obvious effects

plot(pc$x[,2],
     pc$x[,3],
     xlab=paste("Comp. 2 (",percent[2],"%)",sep=""),
     ylab=paste("Comp. 3 (",percent[3],"%)",sep=""),
     col=sex.cols[as.integer(factor(sex))],
     pch=c(19,17)[as.integer(factor(dat))],     
     main="Principal Component Analysis",sub="variance stabilized counts")
legend("bottomleft",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topright",pch=c(NA,21,24),col=c(NA,1,1),
       legend=c("Symbol:",levels(factor(dat))),cex=0.85)