Effect size estimation with apeglm

knitr::opts_chunk$set(tidy=FALSE, cache=FALSE,
                      dev="png",
                      message=FALSE, error=FALSE, warning=TRUE)

Typical RNA-seq call from DESeq2

Note: the typical RNA-seq workflow for users would be to call apeglm estimation from within the lfcShrink function from the DESeq2 package. The unevaluated code chunk shows how to obtain apeglm shrinkage estimates after running DESeq. See the DESeq2 vignette for more details. The lfcShrink wrapper function takes care of many details below, and unifies the interface for multiple shrinkage estimators. The coefficient to shrink can be specified either by name or by number (following the order in resultsNames(dds)). Be aware that DESeq2's lfcShrink interface provides LFCs on the log2 scale, while apeglm provides coefficients on the natural log scale.

res <- lfcShrink(dds, coef=2, type="apeglm")

Acknowledgments

Joshua Zitovsky contributed fast C++ code for the beta-binomial likelihood, demonstrated in a later section of the vignette.

We have benefited in the development of apeglm from feedback or contributions from the following individuals:

Wolfgang Huber, Cecile Le Sueur, Charlotte Soneson

Example RNA-seq analysis

Here we show example code which mimics what will happen inside the lfcShrink function when using the apeglm method [@Zhu2018].

Load a prepared SummarizedExperiment:

library(airway)
data(airway)
head(assay(airway))

For demonstration, we will use 5000 genes of the airway dataset, the first from those genes with at least 10 counts across all samples.

keep <- head(which(rowSums(assay(airway)) > 0), 5000)
airway <- airway[keep,]

First run a DESeq2 differential expression analysis [@Love2014] (size factors, and dispersion estimates could similarly be estimated using edgeR):

library(DESeq2)
dds <- DESeqDataSet(airway, ~cell + dex)
dds$dex <- relevel(dds$dex, "untrt")
dds <- DESeq(dds)
res <- results(dds)

Defining data and parameter objects necessary for apeglm. We must multiply the coefficients from DESeq2 by a factor, because apeglm provides natural log coefficients. Again, this would be handled inside of lfcShrink in DESeq2 for a typical RNA-seq analysis.

x <- model.matrix(design(dds), colData(dds))
param <- dispersions(dds)
mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE)
offset <- matrix(log(sizeFactors(dds)),
                 ncol=ncol(dds),
                 nrow=nrow(dds),byrow=TRUE)

Running apeglm

Here apeglm on 5000 genes takes less than a minute on a laptop. It scales with number of genes, the number of samples and the number of variables in the design formula, where here we have 5 coefficients (one for the four cell cultures and one for the difference due to dexamethasone treatment).

We provide apeglm with the SummarizedExperiment although the function can also run on a matrix of counts or other observed data. We specify a coef as well as a threshold which we discuss below. Note that we multiple the threshold by log(2) to convert from log2 scale to natural log scale.

library(apeglm)
system.time({
  fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x),
                threshold=log(2) * 1, mle=mle, offset=offset)
})
str(fit$prior.control)

There are faster implementations of apeglm specifically for negative binomial likelihoods. The version nbinomR is ~5 times faster than the default general.

system.time({
  fitR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR")
})

The version nbinomCR is ~10 times faster than the default general.

system.time({
  fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR")
})

The version nbinomC returns only the MAP coefficients and can be ~50-100 times faster than the default general. The MAP coefficients are the same as returned by nbinomCR above, we just skip the calculation of posterior SD. A variant of nbinomC is nbinomC* which includes random starts.

system.time({
  fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC")
})

Among other output, we have the estimated coefficients attached to the ranges of the SummarizedExperiment used as input:

class(fit$ranges)
mcols(fit$ranges, use.names=TRUE)

We can compare the coefficients from apeglm with the "normal" shrinkage type from the original DESeq2 paper (2014). This method, which makes use of a Normal-based prior, is no longer the default shrinkage estimator for lfcShrink. apeglm provides coefficients on the natural log scale, so we must convert to log2 scale by multiplying by log2(exp(1)). Note that DESeq2's lfcShrink function converts apeglm coefficients to the log2 scale internally.

system.time({
  res.shr <- lfcShrink(dds, coef=5, type="normal")
})
DESeq2.lfc <- res.shr$log2FoldChange
apeglm.lfc <- log2(exp(1)) * fit$map[,5]

Here we plot apeglm estimators against DESeq2:

plot(DESeq2.lfc, apeglm.lfc)
abline(0,1)

Here we plot MLE, DESeq2 and apeglm estimators against the mean of normalized counts:

par(mfrow=c(1,3))
lims <- c(-8,8)
hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2))
xlab <- "mean of normalized counts"
plot(res$baseMean, res$log2FoldChange, log="x",
     ylim=lims, main="MLE", xlab=xlab)
hline()
plot(res$baseMean, DESeq2.lfc, log="x",
     ylim=lims, main="DESeq2", xlab=xlab)
hline()
plot(res$baseMean, apeglm.lfc, log="x",
     ylim=lims, main="apeglm", xlab=xlab)
hline()

Specific coefficients

Note that p-values and FSR define different events, and are not on the same scale. An FSR of 0.5 means that the estimated sign is as bad as random guess.

par(mfrow=c(1,2),mar=c(5,5,1,1))
plot(res$pvalue, fit$fsr, col="blue",
     xlab="DESeq2 pvalue", ylab="apeglm local FSR",
     xlim=c(0,1), ylim=c(0,.5))
abline(0,1)
plot(-log10(res$pvalue), -log10(fit$fsr),
     xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR",
     col="blue")
abline(0,1)

The s-value was proposed by @Stephens2016, as a statistic giving the aggregate false sign rate for tests with equal or lower s-value than the one considered. We recommend using a lower threshold on s-values than typically used for adjusted p-values, for example one might be interested in sets with 0.01 or 0.005 aggregate FSR.

plot(res$padj, fit$svalue, col="blue",
     xlab="DESeq2 padj", ylab="apeglm svalue",
     xlim=c(0,.2), ylim=c(0,.02))

More scrutiny can be applied by using an LFC threshold greater than zero, and asking for the probability of a "false-sign-or-small" (FSOS) event: that the effect size is not further from zero in distance than the threshold amount. We can run the svalue function on these per-gene probabilities to produce s-values that bound the FSOS rate for sets of genes. By specifying threshold=log(2) * 1 above, apeglm will then output a vector thresh in the results list that gives the per-gene probabilities of false-sign-or-small events.

s.val <- svalue(fit$thresh)
cols <- ifelse(s.val < .01, "red", "black")
keep <- res$baseMean > 10 # filter for plot
plot(res$baseMean[keep],
     log2(exp(1)) * fit$map[keep,5],
     log="x", col=cols[keep],
     xlab=xlab, ylab="LFC")
abline(h=c(-1,0,1), col=rgb(1,0,0,.5), lwd=2)

Modeling zero-inflated counts

We have created a separate GitHub repository giving an example of how the apeglm estimator can be used for Zero-Inflated Negative Binomial data. This approach uses the zinbwave method and Bioconductor package to estimate the probability of each zero belonging to the excess zero component. We compare using a Negative Binomial likelihood with the excess zeros down-weighted and using a Zero-Inflated Negative Binomial likelihood. These two approaches with apeglm perform similarly but we note that the first approach involves less additional code and is faster to compute.

https://github.com/mikelove/zinbwave-apeglm

Modeling ratios of counts

We also show an short example using an alternative likelihood to the negative binomial. Suppose we have allele-specific counts for n=20 vs 20 samples across 5000 genes. We can define a binomial model and test for the allelic balance across groups of samples.

Here we will simulate allele counts from our existing dataset for demonstration. We spike in 10 genes with strong allelic imbalance (instead of an allelic ratio close to 0.5, these will have a ratio of 0.75).

library(emdbook)
n <- 20
f <- factor(rep(1:2,each=n))
mu <- ifelse(res$baseMean > 50, res$baseMean, 50)
set.seed(1)
cts <- matrix(rnbinom(nrow(dds)*2*n, 
                      mu=mu,
                      size=1/dispersions(dds)),
              ncol=2*n)
theta <- runif(nrow(cts),1,1000)
prob <- rnorm(nrow(cts),.5,.05) # close to 0.5
ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob,
                             size=cts, theta=rep(theta,ncol(cts))),
                  nrow=nrow(cts))
idx <- 1:10
idx2 <- which(f == 2)
theta[idx] <- 1000
prob[idx] <- 0.75
# the spiked in genes have an allelic ratio of 0.75
ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=prob[idx],
                                       size=cts[idx,idx2], theta=theta[idx]),
                            nrow=length(idx))

We first need to estimate MLE coefficients and standard errors.

One option to run apeglm would be to define a beta-binomial likelihood function which uses the total counts as a parameter, and the logit function as a link. And then this function could be provided to the log.lik argument.

betabinom.log.lik <- function(y, x, beta, param, offset) {
  xbeta <- x %*% beta
  p.hat <- (1+exp(-xbeta))^-1
  dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE)
}

However, apeglm has faster C++ implementations for the beta-binomial, which were implemented and tested by Joshua Zitovsky. Here, using method="betabinCR" is 3 times faster than using the R-defined log.lik, and the "betabinCR" method also scales significantly better with more samples and more coefficients. As with the negative binomial, method="betabinC" can be used if the standard errors are not needed, and this will be 5 times faster than the R-defined log.lik approach on this dataset.

The following code performs two iterations of estimating the MLE coefficients, and then estimating the beta-binomial dispersion.

theta.hat <- 100 # rough initial estimate of dispersion
x <- model.matrix(~f)
niter <- 3
system.time({
  for (i in 1:niter) {
    param <- cbind(theta.hat, cts)
    fit.mle <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param,
                      no.shrink=TRUE, log.link=FALSE, method="betabinCR")
    theta.hat <- bbEstDisp(success=ase.cts, size=cts,
                           x=x, beta=fit.mle$map,
                           minDisp=.01, maxDisp=5000)
  }
})

We can then plot the MLE estimates over the mean:

coef <- 2
xlab <- "mean of normalized counts"
plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="log odds")
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)

Now we run the posterior estimation, including a prior on the second coefficient:

mle <- cbind(fit.mle$map[,coef], fit.mle$sd[,coef])
param <- cbind(theta.hat, cts)
system.time({
  fit2 <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param,
                 coef=coef, mle=mle, threshold=0.5,
                 log.link=FALSE, method="betabinCR")
})

In the apeglm plot, we color in red the genes with a low aggregate probability of false-sign-or-small (FSOS) events (s-value < .01), where we've again defined "small" on the log odds scale using the threshold argument above.

par(mfrow=c(1,2))
ylim <- c(-1,1.5)
s.val <- svalue(fit2$thresh) # small-or-false-sign value
plot(res$baseMean, fit.mle$map[,coef], main="MLE",
     log="x", xlab=xlab, ylab="log odds", ylim=ylim)
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
abline(h=0,col=rgb(1,0,0,.5))
cols <- ifelse(s.val < .01, "red", "black")
plot(res$baseMean, fit2$map[,coef], main="apeglm",
     log="x", xlab=xlab, ylab="log odds", col=cols, ylim=ylim)
points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3)
abline(h=0,col=rgb(1,0,0,.5))
logit <- function(x) log(x/(1-x))
logit(.75)
table(abs(logit(prob[s.val < .01])) > .5)

Session Info

sessionInfo()

References



Try the apeglm package in your browser

Any scripts or data that you put into this service are public.

apeglm documentation built on Nov. 8, 2020, 5:55 p.m.