In UPSCb/RnaSeqTutorial: RNA-Seq Tutorial

    # TODO keep in mind to save (as an rda in data) a copy of the count table

    startQuest("Differential Expression Analysis - DESeq and DESeq2")

    h1("Sexual dimorphism study Differential Expression")

    h2("Doing the differential expression analysis.")

    library(RnaSeqTutorial)
    samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
    res <- mclapply(dir(file.path(extdata(),"htseq"),
                    pattern="^[2,3].*_STAR\\.txt",
                    full.names=TRUE),function(fil){
      read.delim(fil,header=FALSE,stringsAsFactors=FALSE)
    },mc.cores=2)
    names(res) <- gsub("_.*_STAR\\.txt","",dir(file.path(extdata(),"htseq"),
                                           pattern="^[2,3].*_STAR\\.txt"))
    addInfo <- c("__no_feature","__ambiguous",
             "__too_low_aQual","__not_aligned",
             "__alignment_not_unique")
    sel <- match(addInfo,res[[1]][,1])
    count.table <- do.call(cbind,lapply(res,"[",2))[-sel,]
    colnames(count.table) <- names(res)
    rownames(count.table) <- res[[1]][,1][-sel]
    conditions <- names(res)
    sex <- samples$sex[match(names(res),samples$sample)]
    date <- factor(samples$date[match(names(res),samples$sample)])
    pal=brewer.pal(8,"Dark2")

The QA revealed that we have a confounding factor. Luckily, enough the grouping of sex is balanced between sampling years and hence we will be able to block the year effect to investigate the sex effect.

yearSexDesign <- samples[grep("[2,3].*",samples$sample),c("sex","date")]

    h3("DESeq")

The primary reason to use DESeq over DESeq2 or edgeR or any other is that DESeq is very conservative. The second reason is that DESeq give less weight to outliers than DESeq2 does and based on the study by Pakull et al., which identify a candidate gene for sexual determination - the same we simultaneously, independantly originally identified in our analyses - it seems that our sample 226.1 may be mis-classified. However, the sex phenotyping has been done thoroughly, so it may also be that the sex determination is a more complex trait and that Potri.019G047300 is only one factor influencing it As introduced we want to look at the sex by blocking the year effect. We start by estimating the size factors and the dispersion

cdsFull <- newCountDataSet(count.table,yearSexDesign)
cdsFull = estimateSizeFactors( cdsFull )
cdsFull = estimateDispersions( cdsFull )

We can now check the dispersion estimated by DESeq, which is, obviously, conservative. I.e. most dispersion values lie below the actual dispersion fit by about 1 fold.

plotDispLSD(cdsFull)

Next we create both models (one considering the date only and on the date and sex)

fit1 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date + sex ))
fit0 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date))

For the rest of the analysis, we ignore the genes that did not converge in the previous step

sel <- !is.na(fit1$converged) & !is.na(fit0$converged) & fit1$converged & fit0$converged
fit1 <- fit1[sel,]
fit0 <- fit0[sel,]

We next calculate the GLM p-value and adjust them for multiple testing

pvalsGLM = nbinomGLMTest( fit1, fit0 )
padjGLM = p.adjust( pvalsGLM, method="BH" )

We finally visualize the obtained results. A first insight shows that a number of genes have very high fold changes, which is due the fact that these genes have very few read in very few samples. For clarity, in the following plots, we filter those genes with a too high FC.

boxplot(rowSums(count.table[rownames(fit1[fit1$sexM > 20,]),]>0),
        ylab="number of sample with expression > 0",
        main="distribution of the # of samples with\nexpression > 0 for genes with extreme high log2FC"
)
boxplot(rowSums(count.table[rownames(fit1[fit1$sexM < -20,]),]>0),
        ylab="number of sample with expression > 0",
        main="distribution of the # of samples with\nexpression > 0 for genes with extreme low log2FC"
)

The vast majority of these genes are anyway not significant.

plot(density(padjGLM[fit1$sexM > 20]),
     main="Adj. p-value for genes with extreme log2FC",
     xlab="Adj. p-value")
plot(density(padjGLM[fit1$sexM > -20]),
     main="Adj. p-value for genes with extreme log2FC",
     xlab="Adj. p-value")

So we filter them out

sel <- fit1$sexM > -20 & fit1$sexM < 20
fit1 <- fit1[sel,]
padjGLM <- padjGLM[sel]
pvalsGLM <- pvalsGLM[sel]

A further look into the data reveals that one p-value is equal to 0. As this is inadequate for plotting log odds (-log of the p-value), we change these 0s to be 1 log10 fold change smaller than the otherwise smallest p-value (for plotting purposes only!)

padjGLM[padjGLM==0] <- min(padjGLM[padjGLM!=0])/10
pvalsGLM[pvalsGLM==0] <- min(pvalsGLM[pvalsGLM!=0])/10

    h3("Visualize DESeq results")

As a volcano plot with the non-adjusted p-values The red dots circles are the 4 most significantly differentially expressed genes, and the 8 with the absolute highest log2 fold changes (4 positive and 4 negative)

heatscatter(fit1$sexM,-log10(pvalsGLM),
            main="Male vs. Female Differential Expression",cor=FALSE,
            xlab="Log2 Fold Change", ylab="- log(10) p-value")
legend("topleft",bty="n","1% p-value cutoff",lty=2,col="gray")
points(fit1[pvalsGLM<.01,"sexM"],
       -log10(pvalsGLM[pvalsGLM<.01]),
       col="lightblue",pch=19)
pos <- c(order(pvalsGLM)[1:4],
         order(fit1$sexM,decreasing=TRUE)[1:4],
         order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(pvalsGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")

As a volcano plot with adjusted p-values

heatscatter(fit1$sexM,-log10(padjGLM),
            main="Male vs. Female Differential Expression",cor=FALSE,
            xlab="Log2 Fold Change", ylab="- log(10) adj. p-value")
legend("topleft",bty="n","1% FDR cutoff",lty=2,col="gray")
points(fit1[padjGLM<.01,"sexM"],-log10(padjGLM[padjGLM<.01]),col="lightblue",pch=19)
pos <- c(order(padjGLM)[1:4],order(fit1$sexM,decreasing=TRUE)[1:4],order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(padjGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")

As DESeq is an "older" implementation of this type of analysis, we will redo the analysis using DESeq2. To ultimately be able to compare the obtained set of differential expression candidate genes, we save the list of genes significant at a 1% adjusted p-value cutoff.

UmAspD <- rownames(fit1[padjGLM<.01,])

which are not so many (one)

UmAspD

    quest(18)
    endQuest()
    startQuest("DESeq vs. DESeq2")

    h3("DESeq2 differential expression analyses")

We redo the analyses performed above. As you will see, it has been made more intuitive in DESeq2 and the same can be achieved in fewer commands.

dds <- DESeqDataSetFromMatrix(
    countData = count.table,
    colData = data.frame(condition=conditions,
                       sex=sex,
                       date=date),
    design = ~date+sex)
dds <- DESeq(dds)
plotDispEsts(dds)
res <- results(dds,contrast=c("sex","M","F"))

In 4 commands we have reproduced the analysis and we have observed that the dispersion estimation looked different. DESeq2 introduces a so called "shrinkage", which you can learn more about in the DESeq2 vignettes. We, next, extract the results using the same 1% cutoff and plot similar validation plots

alpha=0.01
plotMA(res,alpha=alpha)
volcanoPlot(res,alpha=alpha)

The volcano plot look similar to what we observed previously (except that we have by mistake inverted the ratio calculation, we now did F-M), although slightly less dead that in DESeq. We also do not get any genes with a p-value of 0

hist(res$padj,breaks=seq(0,1,.01))
sum(res$padj<alpha,na.rm=TRUE)

4 genes are candidates for differential expression at a 1% adjusted p-value cutoff, which we also save for later comparison

UmAspD2 <- rownames(res[order(res$padj,na.last=TRUE),][1:sum(res$padj<alpha,na.rm=TRUE),])
UmAspD2

We can already observe that one gene is common with the previous list, but that the chromosome 19 candidate has disappeared. This is surprising, given the Pakull et al. publication, that Potri.019G047300.0 (the homolog of Potra000503g03273) does not come out anymore as a differentially expressed candidate gene at a 1% FDR cutoff with DESeq2. But it's adjusted p-value is 2.1% - see below. Moreover it's cook distance - a measure of the homogeneity of the gene expression within a condition - is relatively high (2x the average) and might indicate the presence of an outlier. Looking into more details at the count values and sex assignment shows that the sample 226.1 behaves like a male sample with regards to that gene, so either the sample has been mis-sexed or the Pakull et al. results in P. tremuloides and P.tremula are only a partial view of the true biology.

as.data.frame(res)["Potra000503g03273.0",]
data.frame(
  counts=counts(dds,normalized=TRUE)["Potra000503g03273.0",],
  sex,
  date,
  conditions)
mcols(dds)[names(rowRanges(dds))=="Potra000503g03273.0",]

    quest(19)
    endQuest()
    startQuest("DESeq and DESeq2 analyses comparison")

    h3("DESeq2 with the sample re-sexed")

Nevertheless, while waiting for the next time the tree flowers and we can confirm its sex with certainty, we can assume that the sample was mis-sexed and see what effect a sex correction would have. Given the data from Pakull et al. about that gene; assuming that the sample 226.1 was mis-sexed, we redo the DESeq2 analysis after swaping the sex of that sample

sex[conditions=="226.1"] <- "M"
dds <- DESeqDataSetFromMatrix(
  countData = count.table,
  colData = data.frame(condition=conditions,
                       sex=sex,
                       date=date),
  design = ~ date+sex)
dds <- DESeq(dds)
res.swap <- results(dds,contrast=c("sex","M","F"))
plotMA(res.swap,alpha=alpha)
volcanoPlot(res.swap,alpha=alpha)
hist(res.swap$padj,breaks=seq(0,1,.01))
sum(res.swap$padj<alpha,na.rm=TRUE)
UmAspD2.swap <- rownames(res.swap[order(res.swap$padj,na.last=TRUE),][1:sum(res.swap$padj<alpha,na.rm=TRUE),])
UmAspD2.swap

Fair enough, the gene made it back in the list.

    h3("DESeq2 analysis - sample removed")

But, still, to assess the importance of the 226.1 sample, we redo the DESeq2 DE analysis without it (since we have enough replicate for that removal not to affect the power of our analysis)

sel <- conditions != "226.1"
dds <- DESeqDataSetFromMatrix(
  countData = count.table[,sel],
  colData = data.frame(condition=conditions[sel],
                       sex=sex[sel],
                       date=date[sel]),
  design = ~ date+sex)
dds <- DESeq(dds)
plotDispEsts(dds)
res.sel <- results(dds,contrast=c("sex","M","F"))
plotMA(res.sel,alpha=alpha)
volcanoPlot(res.sel,alpha=alpha)
hist(res.sel$padj,breaks=seq(0,1,.01))
sum(res.sel$padj<alpha,na.rm=TRUE)
UmAspD2.sel <- rownames(res.sel[order(res.sel$padj,na.last=TRUE),][1:sum(res.sel$padj<alpha,na.rm=TRUE),])
UmAspD2.sel

    h2("Results comparison")

We compare the results from the 4 approaches, the easiest way being to plot a Vann Diagram

plot.new()
grid.draw(venn.diagram(list(
  UmAsp=UmAspD2,
  UmAsp.swap=UmAspD2.swap,
  UmAsp.sel=UmAspD2.sel,
  UmAspD = UmAspD),
                       filename=NULL,
                       col=pal[1:4],
                       category.names=c("UmAsp - DESeq2",
                                        "UmAsp - corrected",
                                        "UmAsp - removed",
                                        "UmAsp - DESeq")))

Unlike in the original analysis conducted in the study, where the Populus trichocarpa genome was used as a reference (the P. tremula genome assembly is still a draft assembly), where the gene Potri.014G155300.0 was constantly found by all 4 approaches and the gene Potri.019G047300.0 (Potra000503g03273.0 homolog) was found by DESeq and DESeq2, there is no overlap between both native methods. This is probably because the result in P. tremula are obscured by an assembly artefact. Indeed we know that the two genes in P. trichocarpa have a very high sequence similarity (being putatively very recent paralogs), which appears to have been collapsed into a single locus in P. tremula. We can also note that "messing" with the sample sex attribute or removing that sample leads to the identification of several more genes! It is unclear if these are the results of a technical error or if they would be worth investigating further.

sort(intersect(UmAspD2,UmAspD))
sort(intersect(UmAspD2.swap,UmAspD))
sort(intersect(UmAspD2.sel,UmAspD))

    quest(20)
    endQuest()
    startQuest("Appendix")

    h2("Appendix")

Analyses performed when asking Mike Love (DESeq2 developer) why DESeq2 seem so sensitive to misclassification The short answer was to use the Cook distance to further investigate that and that the 1% FDR cutoff was conservative. These are commented out on purpose. Note that the rowData function has been replaced by rowRanges in Bioconductor version 3.1 and above.

## mis-classified
# sex <- samples$sex[match(colnames(count.table),samples$sample)]
# date <- factor(samples$date[match(colnames(count.table),samples$sample)])
# ddsM <- DESeqDataSetFromMatrix(
#   countData = count.table,
#   colData = data.frame(condition=conditions,
#                        sex=sex,
#                        date=date),
#   design = ~ date+sex)
# ddsM <- DESeq(ddsM)
# resM <- results(ddsM,contrast=c("sex","F","M"))
# 
# counts(ddsM,normalized=TRUE)["Potri.019G047300.0",]
# resM["Potri.019G047300.0",]
# mcols(ddsM)[names(rowData(ddsM)) == "Potri.019G047300.0",]
# 
# ## reclassified
# sexR <- sex
# sexR[5] <- "M"
# ddsR <- DESeqDataSetFromMatrix(
#   countData = count.table,
#   colData = data.frame(condition=conditions,
#                        sex=sexR,
#                        date=date),
#   design = ~ date+sex)
# ddsR <- DESeq(ddsR)
# resR <- results(ddsR,contrast=c("sex","F","M"))
# 
# counts(ddsR,normalized=TRUE)["Potri.019G047300.0",]
# resR["Potri.019G047300.0",]
# mcols(ddsR)[names(rowData(ddsR)) == "Potri.019G047300.0",]
# 
# ## samples details
# sex
# date
# 
# ## the model is not be affected by the reclassification
# lapply(split(sex,date),table)
# lapply(split(sexR,date),table)
#