knitr::opts_chunk$set(tidy=FALSE, cache=FALSE, dev="png", message=FALSE, error=FALSE, warning=TRUE)

**Note:** the typical RNA-seq workflow for users would be to call
*apeglm* estimation from within the `lfcShrink`

function from the
*DESeq2* package. The unevaluated code chunk shows how to obtain
*apeglm* shrinkage estimates after running `DESeq`

. See the DESeq2 vignette for
more details. The `lfcShrink`

wrapper function takes care of many
details below, and unifies the interface for multiple shrinkage
estimators. The coefficient to shrink can be specified either by name
or by number (following the order in `resultsNames(dds)`

). Be aware
that *DESeq2*'s `lfcShrink`

interface provides LFCs on the log2 scale,
while *apeglm* provides coefficients on the natural log scale.

res <- lfcShrink(dds, coef=2, type="apeglm")

Joshua Zitovsky contributed fast C++ code for the beta-binomial likelihood, demonstrated in a later section of the vignette.

We have benefited in the development of `apeglm`

from feedback or
contributions from the following individuals:

Wolfgang Huber, Cecile Le Sueur, Charlotte Soneson

Here we show example code which mimics what will happen inside the
`lfcShrink`

function when using the *apeglm* method [@Zhu2018].

Load a prepared `SummarizedExperiment`

:

library(airway) data(airway) head(assay(airway))

For demonstration, we will use 5000 genes of the `airway`

dataset,
the first from those genes with at least 10 counts across all samples.

keep <- head(which(rowSums(assay(airway)) > 0), 5000) airway <- airway[keep,]

First run a *DESeq2* differential expression analysis [@Love2014]
(size factors, and dispersion estimates could similarly
be estimated using *edgeR*):

library(DESeq2) dds <- DESeqDataSet(airway, ~cell + dex) dds$dex <- relevel(dds$dex, "untrt") dds <- DESeq(dds) res <- results(dds)

Defining data and parameter objects necessary for `apeglm`

.
We must multiply the coefficients from *DESeq2* by a factor,
because *apeglm* provides natural log coefficients. Again,
this would be handled inside of `lfcShrink`

in *DESeq2* for a typical
RNA-seq analysis.

x <- model.matrix(design(dds), colData(dds)) param <- dispersions(dds) mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE) offset <- matrix(log(sizeFactors(dds)), ncol=ncol(dds), nrow=nrow(dds),byrow=TRUE)

Here `apeglm`

on 5000 genes takes less than a minute on a laptop.
It scales with number of genes, the number of samples and the number
of variables in the design formula, where here we have 5 coefficients
(one for the four cell cultures and one for the difference due to
dexamethasone treatment).

We provide `apeglm`

with the *SummarizedExperiment* although the
function can also run on a *matrix* of counts or other observed data.
We specify a `coef`

as well as a `threshold`

which we discuss
below. Note that we multiple the `threshold`

by `log(2)`

to convert
from log2 scale to natural log scale.

library(apeglm) system.time({ fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset) }) str(fit$prior.control)

There are faster implementations of apeglm specifically for negative
binomial likelihoods. The version `nbinomR`

is ~5 times
faster than the default `general`

.

system.time({ fitR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR") })

The version `nbinomCR`

is ~10 times faster than the default `general`

.

system.time({ fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR") })

The version `nbinomC`

returns only the MAP coefficients and
can be ~50-100 times faster than the default `general`

. The MAP
coefficients are the same as returned by `nbinomCR`

above, we just
skip the calculation of posterior SD. A variant of `nbinomC`

is
`nbinomC*`

which includes random starts.

system.time({ fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC") })

Among other output, we have the estimated coefficients attached to the
ranges of the *SummarizedExperiment* used as input:

class(fit$ranges) mcols(fit$ranges, use.names=TRUE)

We can compare the coefficients from *apeglm* with the `"normal"`

shrinkage type from the original *DESeq2* paper (2014). This method,
which makes use of a Normal-based prior, is no longer the default
shrinkage estimator for `lfcShrink`

.
*apeglm* provides coefficients on the natural log scale, so we must
convert to log2 scale by multiplying by `log2(exp(1))`

. Note that
*DESeq2*'s `lfcShrink`

function converts *apeglm* coefficients to the log2
scale internally.

system.time({ res.shr <- lfcShrink(dds, coef=5, type="normal") }) DESeq2.lfc <- res.shr$log2FoldChange apeglm.lfc <- log2(exp(1)) * fit$map[,5]

Here we plot *apeglm* estimators against *DESeq2*:

plot(DESeq2.lfc, apeglm.lfc) abline(0,1)

Here we plot *MLE*, *DESeq2* and *apeglm* estimators against the mean
of normalized counts:

par(mfrow=c(1,3)) lims <- c(-8,8) hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2)) xlab <- "mean of normalized counts" plot(res$baseMean, res$log2FoldChange, log="x", ylim=lims, main="MLE", xlab=xlab) hline() plot(res$baseMean, DESeq2.lfc, log="x", ylim=lims, main="DESeq2", xlab=xlab) hline() plot(res$baseMean, apeglm.lfc, log="x", ylim=lims, main="apeglm", xlab=xlab) hline()

Note that p-values and FSR define different events, and are not on the same scale. An FSR of 0.5 means that the estimated sign is as bad as random guess.

par(mfrow=c(1,2),mar=c(5,5,1,1)) plot(res$pvalue, fit$fsr, col="blue", xlab="DESeq2 pvalue", ylab="apeglm local FSR", xlim=c(0,1), ylim=c(0,.5)) abline(0,1) plot(-log10(res$pvalue), -log10(fit$fsr), xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR", col="blue") abline(0,1)

The s-value was proposed by @Stephens2016, as a statistic giving the aggregate false sign rate for tests with equal or lower s-value than the one considered. We recommend using a lower threshold on s-values than typically used for adjusted p-values, for example one might be interested in sets with 0.01 or 0.005 aggregate FSR.

plot(res$padj, fit$svalue, col="blue", xlab="DESeq2 padj", ylab="apeglm svalue", xlim=c(0,.2), ylim=c(0,.02))

More scrutiny can be applied by using an LFC threshold greater
than zero, and asking for the probability of a "false-sign-or-small"
(FSOS) event:
that the effect size is not further from zero in distance than
the threshold amount. We can run the `svalue`

function on these
per-gene probabilities to produce s-values that bound the FSOS rate
for sets of genes.
By specifying `threshold=log(2) * 1`

above, `apeglm`

will then output
a vector `thresh`

in the results list that gives the per-gene
probabilities of false-sign-or-small events.

s.val <- svalue(fit$thresh) cols <- ifelse(s.val < .01, "red", "black") keep <- res$baseMean > 10 # filter for plot plot(res$baseMean[keep], log2(exp(1)) * fit$map[keep,5], log="x", col=cols[keep], xlab=xlab, ylab="LFC") abline(h=c(-1,0,1), col=rgb(1,0,0,.5), lwd=2)

We have created a separate GitHub repository giving an example of how
the *apeglm* estimator can be used for Zero-Inflated Negative Binomial
data. This approach uses the *zinbwave* method and Bioconductor
package to estimate the probability of each zero belonging to the
excess zero component. We compare using a Negative Binomial likelihood
with the excess zeros down-weighted and using a Zero-Inflated Negative
Binomial likelihood. These two approaches with *apeglm* perform
similarly but we note that the first approach involves less additional
code and is faster to compute.

https://github.com/mikelove/zinbwave-apeglm

We also show an short example using an alternative likelihood to the negative binomial. Suppose we have allele-specific counts for n=20 vs 20 samples across 5000 genes. We can define a binomial model and test for the allelic balance across groups of samples.

Here we will *simulate* allele counts from our existing dataset
for demonstration. We spike in 10 genes with strong allelic imbalance
(instead of an allelic ratio close to 0.5, these will have a ratio of
0.75).

library(emdbook) n <- 20 f <- factor(rep(1:2,each=n)) mu <- ifelse(res$baseMean > 50, res$baseMean, 50) set.seed(1) cts <- matrix(rnbinom(nrow(dds)*2*n, mu=mu, size=1/dispersions(dds)), ncol=2*n) theta <- runif(nrow(cts),1,1000) prob <- rnorm(nrow(cts),.5,.05) # close to 0.5 ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob, size=cts, theta=rep(theta,ncol(cts))), nrow=nrow(cts)) idx <- 1:10 idx2 <- which(f == 2) theta[idx] <- 1000 prob[idx] <- 0.75 # the spiked in genes have an allelic ratio of 0.75 ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=prob[idx], size=cts[idx,idx2], theta=theta[idx]), nrow=length(idx))

We first need to estimate MLE coefficients and standard errors.

One option to run `apeglm`

would be to define a beta-binomial
likelihood function which uses the total counts as a parameter, and
the logit function as a link. And then this function could be provided
to the `log.lik`

argument.

betabinom.log.lik <- function(y, x, beta, param, offset) { xbeta <- x %*% beta p.hat <- (1+exp(-xbeta))^-1 dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE) }

However, `apeglm`

has faster C++ implementations for the beta-binomial,
which were implemented and tested by Joshua Zitovsky. Here, using
`method="betabinCR"`

is 3 times faster than using the R-defined
`log.lik`

, and the `"betabinCR"`

method also scales significantly
better with more samples and more coefficients. As with the negative
binomial, `method="betabinC"`

can be used if the standard errors are
not needed, and this will be 5 times faster than the R-defined
`log.lik`

approach on this dataset.

The following code performs two iterations of estimating the MLE coefficients, and then estimating the beta-binomial dispersion.

theta.hat <- 100 # rough initial estimate of dispersion x <- model.matrix(~f) niter <- 3 system.time({ for (i in 1:niter) { param <- cbind(theta.hat, cts) fit.mle <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param, no.shrink=TRUE, log.link=FALSE, method="betabinCR") theta.hat <- bbEstDisp(success=ase.cts, size=cts, x=x, beta=fit.mle$map, minDisp=.01, maxDisp=5000) } })

We can then plot the MLE estimates over the mean:

coef <- 2 xlab <- "mean of normalized counts" plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="log odds") points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)

Now we run the posterior estimation, including a prior on the second coefficient:

mle <- cbind(fit.mle$map[,coef], fit.mle$sd[,coef]) param <- cbind(theta.hat, cts) system.time({ fit2 <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param, coef=coef, mle=mle, threshold=0.5, log.link=FALSE, method="betabinCR") })

In the `apeglm`

plot, we color in red the genes with a low aggregate
probability of false-sign-or-small (FSOS) events (s-value < .01), where
we've again defined "small" on the log odds scale using the
`threshold`

argument above.

par(mfrow=c(1,2)) ylim <- c(-1,1.5) s.val <- svalue(fit2$thresh) # small-or-false-sign value plot(res$baseMean, fit.mle$map[,coef], main="MLE", log="x", xlab=xlab, ylab="log odds", ylim=ylim) points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5)) cols <- ifelse(s.val < .01, "red", "black") plot(res$baseMean, fit2$map[,coef], main="apeglm", log="x", xlab=xlab, ylab="log odds", col=cols, ylim=ylim) points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5))

logit <- function(x) log(x/(1-x)) logit(.75) table(abs(logit(prob[s.val < .01])) > .5)

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.