knitr::opts_chunk$set(tidy=FALSE, cache=FALSE, dev="png", message=FALSE, error=FALSE, warning=TRUE)
Note: the typical RNA-seq workflow for users would be to call
apeglm estimation from within the lfcShrink
function from the
DESeq2 package. The unevaluated code chunk shows how to obtain
apeglm shrinkage estimates after running DESeq
. See the DESeq2 vignette for
more details. The lfcShrink
wrapper function takes care of many
details below, and unifies the interface for multiple shrinkage
estimators. The coefficient to shrink can be specified either by name
or by number (following the order in resultsNames(dds)
). Be aware
that DESeq2's lfcShrink
interface provides LFCs on the log2 scale,
while apeglm provides coefficients on the natural log scale.
res <- lfcShrink(dds, coef=2, type="apeglm")
Joshua Zitovsky contributed fast C++ code for the beta-binomial likelihood, demonstrated in a later section of the vignette.
We have benefited in the development of apeglm
from feedback or
contributions from the following individuals:
Wolfgang Huber, Cecile Le Sueur, Charlotte Soneson
Here we show example code which mimics what will happen inside the
lfcShrink
function when using the apeglm method [@Zhu2018].
Load a prepared SummarizedExperiment
:
library(airway) data(airway) head(assay(airway))
For demonstration, we will use 3000 genes of the airway
dataset,
the first from those genes with at least 10 counts across all samples.
keep <- head(which(rowSums(assay(airway)) >= 10), 3000) airway <- airway[keep,]
First run a DESeq2 differential expression analysis [@Love2014] (size factors, and dispersion estimates could similarly be estimated using edgeR):
library(DESeq2) dds <- DESeqDataSet(airway, ~cell + dex) dds$dex <- relevel(dds$dex, "untrt") dds <- DESeq(dds) res <- results(dds)
Defining data and parameter objects necessary for apeglm
.
We must multiply the coefficients from DESeq2 by a factor,
because apeglm provides natural log coefficients. Again,
this would be handled inside of lfcShrink
in DESeq2 for a typical
RNA-seq analysis.
x <- model.matrix(design(dds), colData(dds)) param <- dispersions(dds) mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE) offset <- matrix(log(sizeFactors(dds)), ncol=ncol(dds), nrow=nrow(dds),byrow=TRUE)
Here apeglm
on 3000 genes takes a few seconds on a laptop.
It scales with number of genes, the number of samples and the number
of variables in the design formula, where here we have 5 coefficients
(one for the four cell cultures and one for the difference due to
dexamethasone treatment).
We provide apeglm
with the SummarizedExperiment although the
function can also run on a matrix of counts or other observed data.
We specify a coef
as well as a threshold
which we discuss
below. Note that we multiple the threshold
by log(2)
to convert
from log2 scale to natural log scale.
The original interface to apeglm looked as follows, but was slow, we therefore recommend using the faster interfaces shown below.
# original interface, also may give Lapack error fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset)
There are better, faster implementations of apeglm specifically for
negative binomial likelihoods. The version nbinomR
is ~5 times
faster than the default method="general"
.
We will use this run for downstream analysis. Here we pass the
SummarizedExperiment, which will pass back the ranges. For faster
apeglm
runs, provide only the counts matrix (as seen in subsequent
chunks).
library(apeglm) system.time({ fitR <- apeglm(Y=airway, x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR") }) fit <- fitR names(fit) str(fit$prior.control)
The version nbinomCR
is ~10 times faster than the default
general
.
system.time({ fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR") })
The version nbinomC
returns only the MAP coefficients and
can be ~50-100 times faster than the default general
. The MAP
coefficients are the same as returned by nbinomCR
above, we just
skip the calculation of posterior SD. A variant of nbinomC
is
nbinomC*
which includes random starts.
system.time({ fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC") })
Among other output, we have the estimated coefficients attached to the ranges of the SummarizedExperiment used as input:
class(fit$ranges) mcols(fit$ranges, use.names=TRUE)
We can compare the coefficients from apeglm with the "normal"
shrinkage type from the original DESeq2 paper (2014). This method,
which makes use of a Normal-based prior, is no longer the default
shrinkage estimator for lfcShrink
.
apeglm provides coefficients on the natural log scale, so we must
convert to log2 scale by multiplying by log2(exp(1))
. Note that
DESeq2's lfcShrink
function converts apeglm coefficients to the log2
scale internally.
system.time({ res.shr <- lfcShrink(dds, coef=5, type="normal") }) DESeq2.lfc <- res.shr$log2FoldChange apeglm.lfc <- log2(exp(1)) * fit$map[,5]
Here we plot apeglm estimators against DESeq2:
plot(DESeq2.lfc, apeglm.lfc) abline(0,1)
Here we plot MLE, DESeq2 and apeglm estimators against the mean of normalized counts:
par(mfrow=c(1,3)) lims <- c(-8,8) hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2)) xlab <- "mean of normalized counts" plot(res$baseMean, res$log2FoldChange, log="x", ylim=lims, main="MLE", xlab=xlab) hline() plot(res$baseMean, DESeq2.lfc, log="x", ylim=lims, main="DESeq2", xlab=xlab) hline() plot(res$baseMean, apeglm.lfc, log="x", ylim=lims, main="apeglm", xlab=xlab) hline()
Note that p-values and FSR define different events, and are not on the same scale. An FSR of 0.5 means that the estimated sign is as bad as random guess.
par(mfrow=c(1,2),mar=c(5,5,1,1)) plot(res$pvalue, fit$fsr, col="blue", xlab="DESeq2 pvalue", ylab="apeglm local FSR", xlim=c(0,1), ylim=c(0,.5)) abline(0,1) plot(-log10(res$pvalue), -log10(fit$fsr), xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR", col="blue") abline(0,1)
The s-value was proposed by @Stephens2016, as a statistic giving the aggregate false sign rate for tests with equal or lower s-value than the one considered. We recommend using a lower threshold on s-values than typically used for adjusted p-values, for example one might be interested in sets with 0.01 or 0.005 aggregate FSR.
plot(res$padj, fit$svalue, col="blue", xlab="DESeq2 padj", ylab="apeglm svalue", xlim=c(0,.2), ylim=c(0,.02))
More scrutiny can be applied by using an LFC threshold greater
than zero, and asking for the probability of a "false-sign-or-small"
(FSOS) event:
that the effect size is not further from zero in distance than
the threshold amount. We can run the svalue
function on these
per-gene probabilities to produce s-values that bound the FSOS rate
for sets of genes.
By specifying threshold=log(2) * 1
above, apeglm
will then output
a vector thresh
in the results list that gives the per-gene
probabilities of false-sign-or-small events.
s.val <- svalue(fit$thresh) cols <- ifelse(s.val < .01, "red", "black") keep <- res$baseMean > 10 # filter for plot plot(res$baseMean[keep], log2(exp(1)) * fit$map[keep,5], log="x", col=cols[keep], xlab=xlab, ylab="LFC") abline(h=c(-1,0,1), col=rgb(1,0,0,.5), lwd=2)
We have created a separate GitHub repository giving an example of how the apeglm estimator can be used for Zero-Inflated Negative Binomial data. This approach uses the zinbwave method and Bioconductor package to estimate the probability of each zero belonging to the excess zero component. We compare using a Negative Binomial likelihood with the excess zeros down-weighted and using a Zero-Inflated Negative Binomial likelihood. These two approaches with apeglm perform similarly but we note that the first approach involves less additional code and is faster to compute.
https://github.com/mikelove/zinbwave-apeglm
We also show an short example using an alternative likelihood to the negative binomial. Suppose we have allele-specific counts for n=20 vs 20 samples across 5000 genes. We can define a binomial model and test for the allelic balance across groups of samples.
Here we will simulate allele counts from our existing dataset for demonstration. We spike in 10 genes with strong allelic imbalance (instead of an allelic ratio close to 0.5, these will have a ratio of 0.75).
library(emdbook) n <- 20 f <- factor(rep(1:2,each=n)) mu <- ifelse(res$baseMean > 50, res$baseMean, 50) set.seed(1) cts <- matrix(rnbinom(nrow(dds)*2*n, mu=mu, size=1/dispersions(dds)), ncol=2*n) theta <- runif(nrow(cts),1,1000) prob <- rnorm(nrow(cts),.5,.05) # close to 0.5 ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob, size=cts, theta=rep(theta,ncol(cts))), nrow=nrow(cts)) idx <- 1:10 idx2 <- which(f == 2) theta[idx] <- 1000 prob[idx] <- 0.75 # the spiked in genes have an allelic ratio of 0.75 ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=prob[idx], size=cts[idx,idx2], theta=theta[idx]), nrow=length(idx))
We first need to estimate MLE coefficients and standard errors.
One option to run apeglm
would be to define a beta-binomial
likelihood function which uses the total counts as a parameter, and
the logit function as a link. And then this function could be provided
to the log.lik
argument.
betabinom.log.lik <- function(y, x, beta, param, offset) { xbeta <- x %*% beta p.hat <- (1+exp(-xbeta))^-1 dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE) }
However, apeglm
has faster C++ implementations for the beta-binomial,
which were implemented and tested by Joshua Zitovsky. Here, using
method="betabinCR"
is 3 times faster than using the R-defined
log.lik
, and the "betabinCR"
method also scales significantly
better with more samples and more coefficients. As with the negative
binomial, method="betabinC"
can be used if the standard errors are
not needed, and this will be 5 times faster than the R-defined
log.lik
approach on this dataset.
The following code performs two iterations of estimating the MLE coefficients, and then estimating the beta-binomial dispersion.
theta.hat <- 100 # rough initial estimate of dispersion x <- model.matrix(~f) niter <- 3 system.time({ for (i in 1:niter) { param <- cbind(theta.hat, cts) fit.mle <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param, no.shrink=TRUE, log.link=FALSE, method="betabinCR") theta.hat <- bbEstDisp(success=ase.cts, size=cts, x=x, beta=fit.mle$map, minDisp=.01, maxDisp=5000) } })
We can then plot the MLE estimates over the mean:
coef <- 2 xlab <- "mean of normalized counts" plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="log odds") points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
Now we run the posterior estimation, including a prior on the second coefficient:
mle <- cbind(fit.mle$map[,coef], fit.mle$sd[,coef]) param <- cbind(theta.hat, cts) system.time({ fit2 <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param, coef=coef, mle=mle, threshold=0.5, log.link=FALSE, method="betabinCR") })
In the apeglm
plot, we color in red the genes with a low aggregate
probability of false-sign-or-small (FSOS) events (s-value < .01), where
we've again defined "small" on the log odds scale using the
threshold
argument above.
par(mfrow=c(1,2)) ylim <- c(-1,1.5) s.val <- svalue(fit2$thresh) # small-or-false-sign value plot(res$baseMean, fit.mle$map[,coef], main="MLE", log="x", xlab=xlab, ylab="log odds", ylim=ylim) points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5)) cols <- ifelse(s.val < .01, "red", "black") plot(res$baseMean, fit2$map[,coef], main="apeglm", log="x", xlab=xlab, ylab="log odds", col=cols, ylim=ylim) points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5))
logit <- function(x) log(x/(1-x)) logit(.75) table(abs(logit(prob[s.val < .01])) > .5)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.