Michael Love, Wolfgang Huber, Simon Anders
opts_chunk$set(dev="pdf")
Document compiled on:
Sys.time()
This document and associated files provide the code and plots for the simulation section of the DESeq2 paper, so that these results can be easily updated over time. For more details, read the paper at the following URL:
http://dx.doi.org/10.1101/002832
We assessed the sensitivity and specificity of various algorithms
using simulation to complement an analysis on real data. The following
Negative Binomial simulation samples (mean, dispersion) pairs from the
joint distribution of estimated means and dispersions from the Pickrell
et al dataset. The true differences between two groups are drawn from
either z, 0 or -z, where the 0 component represents 80% of the
genes. The absolute value of the effect size z for the 20% of genes
with differential expression is varied, as is the total sample size m
(such that each group has m/2 samples). 10,000 genes were simulated,
and each combination of parameters was repeated 6 times. The code to
generate these results is in simulateDE.R
and the code to run each
algorithm is in runScripts.R
.
Note: DSS denotes running DSS and then Benjamini-Hochberg adjustment on p-values. DSS-FDR denotes the native FDR estimation of the DSS software. SAMseq denotes running SAMseq with p-value estimation and Benjamini-Hochberg adjustment for FDR. SAMseq-FDR denotes the native FDR estimation and no p-value estimation. EBSeq likewise only produces FDR.
load("results_simulateDE.RData") res$m <- factor(res$m) levels(res$m) <- paste0("m=",levels(res$m)) res$effSize <- factor(res$effSize) levels(res$effSize) <- c("fold change 2","fold change 3","fold change 4") res$algorithm <- factor(res$algorithm) resClean <- res[!is.na(res$oneminusspecpvals),]
library("ggplot2") p <- ggplot(resClean, aes(y=sensitivity, x=oneminusspecpvals, color=algorithm, shape=algorithm)) p + geom_point() + theme_bw() + facet_grid(effSize ~ m) + scale_shape_manual(values=1:9) + xlab("1 - specificity (false positive rate)") + coord_cartesian(xlim=c(-.003,.035)) + geom_vline(xintercept=.01) + scale_x_continuous(breaks=c(0,.02))
Use of simulation to assess the sensitivity and specificity of algorithms across combinations of sample size and effect size. The sensitivity was calculated as the fraction of genes with adjusted p-value less than 0.1 among the genes with true differences between group means. The specificity was calculated as the fraction of genes with p-value greater than 0.01 among the genes with no true differences between group means. The p-value was chosen instead of the adjusted p-value, as this allows for comparison against the expected fraction of p-values less than a critical value given the uniformity of p-values under the null hypothesis.
library("ggplot2") p <- ggplot(res, aes(y=sensitivity, x=oneminusprec, color=algorithm, shape=algorithm)) p + geom_point() + theme_bw() + facet_grid(effSize ~ m) + scale_shape_manual(values=1:9) + xlab("1 - precision (false discovery rate)") + coord_cartesian(xlim=c(-.03, .3)) + geom_vline(xintercept=.1)
Sensitivity and precision of algorithms across combinations of sample size and effect size. The sensitivity was calculated as the fraction of genes with adjusted p-value less than 0.1 among the genes with true differences between group means. The precision was calculated as the fraction of genes with true differences between group means among those with adjusted p-value less than 0.1.
library("reshape") id.vars <- c("algorithm","effSize","m") measure.vars <- c("sens0to20","sens20to100","sens100to300","sensmore300") melted <- melt(res[,c(id.vars,measure.vars)], id.vars=id.vars, measure.vars=measure.vars) names(melted) <- c(id.vars, "aveexp", "sensitivity") levels(melted$aveexp) <- c("<20","20-100","100-300",">300") p <- ggplot(melted, aes(y=sensitivity, x=aveexp, group=algorithm, color=algorithm, shape=algorithm)) p + stat_summary(fun.y="mean", geom="line") + stat_summary(fun.y="mean", geom="point") + theme_bw() + facet_grid(effSize ~ m) + scale_shape_manual(values=1:9) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("mean counts")
The sensitivity of algorithms across combinations of sample size and effect size, and further stratified by the mean of counts of the differentially expressed genes in the simulation data. Points indicate the average over 6 replicates. Algorithms all show a similar dependence of sensitivity on the mean of counts. The height of the sensitivity curve should be compared with the previous plot indicating the total sensitivity and specificity of each algorithm.
The following plots examine the affect of outliers on differential
calls by the two Negative-Binomial-based methods DESeq2 and edgeR.
DESeq2 was run with default settings, after turning off
gene filtering, and after turning off outlier replacement. edgeR
was run with default settings, and after using the robust option.
The code to generate these results is in simulateOutliers.R
.
load("results_simulateOutliers.RData") # when < 7 replicates DESeq does not replace res <- res[!(res$algorithm == "DESeq2-noRepl" & res$m < 14),] # when >= 7 replicates DESeq does not filter res <- res[!(res$algorithm == "DESeq2-noFilt" & res$m >= 14),] res$m <- factor(res$m) levels(res$m) <- paste0("m=",levels(res$m)) res$percentOutlier <- 100 * res$percentOutlier res$percentOutlier <- factor(res$percentOutlier) levels(res$percentOutlier) <- paste0(levels(res$percentOutlier),"% outlier")
Because the sensitivity-specificity curve is evaluated using the p value, we use fhe following code to pick out the point on the sensitivity-specificity curve with largest p value such that the nominal adjusted p-value is less than 0.1.
resSensPadj <- res[res$senspadj < .1,] resSensPadj <- resSensPadj[nrow(resSensPadj):1,] resSensPadj <- resSensPadj[!duplicated(with(resSensPadj, paste(algorithm, m, percentOutlier))),] summary(resSensPadj$senspadj)
library("ggplot2") p <- ggplot(res, aes(x=oneminusspec, y=sensitivity, color=algorithm)) p + scale_x_continuous(breaks=c(0,.1,.2)) + scale_y_continuous(breaks=c(0,.2,.4,.6,.8)) + geom_line() + theme_bw() + facet_grid(m ~ percentOutlier) + xlab("1 - specificity") + coord_cartesian(xlim=c(-.03, .25), ylim=c(-.05, .9)) + geom_point(aes(x=oneminusspec, y=sensitivity, shape=algorithm), data=res[res$precpadj == .1,])
Sensitivity-specificity curves for detecting true differences in the presence of outliers. Negative Binomial counts were simulated for 4000 genes and total sample sizes (m) of 10 and 20, for a two-group comparison. 80% of the simulated genes had no true differential expression, while for 20% of the genes true logarithmic (base 2) fold changes of -1 or 1. The number of genes with simulated outliers was increased from 0% to 15%. The outliers were constructed for a gene by multiplying the count of a single sample by 100. Sensitivity and specificity were calculated by thresholding on p-values. Points indicate an adjusted p-value of 0.1. DESeq2 filters genes with potential outliers for samples with 3 to 6 replicates, while replacing outliers for samples with 7 or more replicates, hence the filtering can be turned off for the top row (m=10) and the replacement can be turned off for the bottom row (m=20).
p <- ggplot(res, aes(x=precpadj, y=oneminusprec, color=algorithm)) p + scale_x_continuous(breaks=c(0,.1,.2)) + scale_y_continuous(breaks=c(0,.1,.2)) + geom_line() + theme_bw() + facet_grid(m ~ percentOutlier) + geom_abline(intercept=0,slope=1) + xlab("adjusted *p*-value") + ylab("1 - precision (FDR)") + coord_cartesian(xlim=c(-.03, .25), ylim=c(-.05, .25)) + geom_point(aes(x=precpadj, y=oneminusprec, shape=algorithm), data=res[res$precpadj == .1,])
Outlier handling: One minus the precision (false discovery rate) plotted over various thresholds of adjusted p-value. Shown is the results for the same simulation with outliers described in the previous figure. Points indicate an adjusted p-value of 0.1.
The following simulations used Negative Binomial random variables
with mean and dispersion pairs samples from the joint
distribution of mean-dispersion estimates from the Pickrell data. In
addition, true differences between two groups were randomly generated,
according to the following models, diagrammed below. The
accuracy of four methods for estimating the log fold change between
groups were compared by the root mean squared error (RMSE) and the
mean absolute error (MAE). The four methods were chosen for their
particular focus on the logs fold change estimate.
The code to generate these results is in simulateLFCAccuracy.R
.
par(mfrow=c(2,2),mar=c(3,3,3,1)) n <- 1000 brks <- seq(from=-4,to=4,length.out=20) trimit <- function(x) x[x > -4 & x < 4] # for visualization only myhist <- function(x, ...) hist(x, breaks=brks, border="white", col="blue", xlab="", ylab="", ...) myhist(trimit(rnorm(n)), main="bell") myhist(trimit(c(rnorm(n * 8/10), runif(n * 2/10, -4, 4))), main="slab bell") myhist(c(rep(0, n * 8/10), runif(n * 2/10, -4, 4)), main="slab spike") myhist(c(rep(0, n * 8/10), sample(c(-2, 2), n * 2/10, TRUE)), main="spike spike")
Benchmarking LFC estimation: Models for simulating logarithmic (base 2) fold changes. For the bell model, true logarithmic fold changes were drawn from a Normal with mean 0 and variance 1. For the slab bell model, true logarithmic fold changes were drawn for 80% of genes from a Normal with mean 0 and variance 1 and for 20% of genes from a Uniform distribution with range from -4 to 4. For the slab spike model, true logarithmic fold changes were drawn similarly to the slab bell model except the Normal is replaced with a spike of logarithmic fold changes at 0. For the spike spike model, true logarithmic fold changes were drawn according to a spike of logarithmic fold changes at 0 (80%) and a spike randomly sampled from -2 or 2 (20%). These spikes represent fold changes of 1/4 and 4, respectively.
load("results_simulateLFCAccuracy.RData") library("ggplot2") library("Hmisc") p <- ggplot(data=res, aes(x=m, y=RMSE, color=method, shape=method)) p + stat_summary(fun.y=mean, geom="point") + stat_summary(fun.y=mean, geom="line") + stat_summary(fun.data=mean_cl_normal, geom="errorbar") + theme_bw() + xlab("total sample size") + facet_wrap(~ type) + scale_x_continuous(breaks=unique(res$m))
Root mean squared error (RMSE) for estimating logarithmic fold changes under the four models of logarithmic fold changes and varying total sample size m. Simulated Negative Binomial counts were generated for two groups and for 1000 genes. Points and error bars are drawn for the mean and 95% confidence interval over 10 replications.
p <- ggplot(data=res[grepl("spike",res$type),], aes(x=m, y=DiffRMSE, color=method, shape=method)) p + stat_summary(fun.y=mean, geom="point") + stat_summary(fun.y=mean, geom="line") + stat_summary(fun.data=mean_cl_normal, geom="errorbar") + theme_bw() + xlab("total sample size") + ylab("RMSE only of DE genes") + facet_wrap(~ type) + scale_x_continuous(breaks=unique(res$m))
Root mean squared error (RMSE) of logarithmic fold change estimates, only considering genes with non-zero true logarithmic fold change. For the same simulation, shown here is the error only for the 20% of genes with non-zero true logarithmic fold changes (for bell and slab bell all genes have non-zero logarithmic fold change).
p <- ggplot(data=res, aes(x=m, y=MAE, color=method, shape=method)) p + stat_summary(fun.y=mean, geom="point") + stat_summary(fun.y=mean, geom="line") + stat_summary(fun.data=mean_cl_normal, geom="errorbar") + theme_bw() + xlab("total sample size") + ylab("MAE") + facet_wrap(~ type) + scale_x_continuous(breaks=unique(res$m))
Mean absolute error (MAE) of logarithmic fold change estimates. Results for the same simulation, however here using median absolute error in place of root mean squared error. Mean absolute error places less weight on the largest errors.
p <- ggplot(data=res[grepl("spike",res$type),], aes(x=m, y=DiffMAE, color=method, shape=method)) p + stat_summary(fun.y=mean, geom="point") + stat_summary(fun.y=mean, geom="line") + stat_summary(fun.data=mean_cl_normal, geom="errorbar") + theme_bw() + xlab("total sample size") + ylab("MAE only of DE genes") + facet_wrap(~ type) + scale_x_continuous(breaks=unique(res$m))
Mean absolute error (MAE) of logarithmic fold change estimates, only considering those genes with non-zero true logarithmic fold change.
The following simulation evaluated a set of methods for
transformation, and for calculating distances betweeen vectors of
counts, for their performance in recapturing true clusters in
simulated data. Negative Binomial counts were generated in
four groups, each with four samples. These groups were generated with
20% of genes given Normally-distributed log fold changes from a centroid.
The standard deviation of the Normal for the non-null genes
was varied to make the clustering easier or more difficult.
The mean of the centroid and
the dispersion of the counts were drawn as pairs from the joint
distribution of estimates from the Pickrell et al
dataset. As the Pickrell dataset has high dispersion (RNA-Seq counts
of lymphoblast cells across a population of individuals), simulations
were also considered wherein the dispersion was 0.1 and 0.25 times the
Pickrell dispersions. Hierarchical clustering with complete linkage was
used to separate the samples into four predicted clusters, using a
variety of combinations of transformation and distance. These
predicted clusters were then compared to the true clusters according to
the simulation using the adjusted Rand Index. Furthermore, two
variations were considered, one in which the size factors between
conditions were equal and one in which the size factors within each
group were [1, 1, 1/3, 3].
The code to generate these results is in simulateCluster.R
.
load("results_simulateCluster.RData") library("ggplot2") library("Hmisc") res$sizeFactor <- factor(res$sizeFactor) levels(res$sizeFactor) <- paste("size factors", levels(res$sizeFactor)) res$dispScale <- factor(res$dispScale) levels(res$dispScale) <- paste(levels(res$dispScale),"x dispersion") p <- ggplot(res, aes(x=rnormsd, y=ARI, color=method, shape=method)) p + stat_summary(fun.y=mean, geom="point", aes(shape=method)) + stat_summary(fun.y=mean, geom="line") + stat_summary(fun.data=mean_cl_normal, geom="errorbar") + facet_grid(sizeFactor ~ dispScale, scale="free") + theme_bw() + ylab("adjusted Rand Index") + xlab("SD of group differences")
Adjusted Rand Index from clusters using various transformation and distances compared to the true clusters from simulation. The methods assessed were Euclidean distance on counts normalized by size factor, log2 of normalized counts plus a pseudocount of 1, and after applying the rlog and variance stabilizing transformation. Additionally, the Poisson Distance from the PoiClaClu package was used for hierarchical clustering. The points indicate the mean from 20 simulations and the bars are 95 percent confidence intervals.
As discussed by Robinson and Smyth and by Rapaport et al., it is desirable that, in the situation in which a gene is only expressed in one condition, the statistical significance increase with the signal to noise ratio of expression in the expressed condition. For example, Rapaport et al. plot the $-\log_{10}(p)$ for p values over the signal to noise ratio. In the following code chunk we demontrate that DESeq2 has increasing $-\log_{10}(p)$ in a comparison of two conditions in which one group has all zero counts, e.g.: ${0,0,0}$ vs ${10,10,10}$.
m <- 6 disp <- 0.5 ii <- seq(from=1,to=4,length=7) coldata <- DataFrame(x=factor(rep(1:2,each=m/2))) pvals <- sapply(ii, function(i) { mat <- matrix(as.integer(c(rep(0,m/2),rep(10^i,m/2))),nrow=1) dds <- DESeqDataSetFromMatrix(mat, coldata, ~ x) sizeFactors(dds) <- rep(1,m) dispersions(dds) <- disp results(nbinomWaldTest(dds))$pvalue }) plot(10^ii, -log10(pvals), log="x", type="b", xaxt="n", xlab="group 2 counts", ylab=expression(-log[10](p~value))) axis(1,10^(1:4),10^(1:4))
The session information saved in the simulateDE
script:
sessInfo
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.