In mengyin/vashr: Variance Adaptive Shrinkage

Introduction

This package provides flexible estimation of variances using Empirical Bayes models [@lu2016vash]. Suppose we observe some raw variance estimates $\hat{s}1^2,...,\hat{s}_J^2$ that are estimates of underlying "true" variances $s_1^2,...,s_J^2$: [\hat{s}_j^2|s_j^2 \sim s_j^2 \frac{\chi^2{d_j}}{d_j},] where the degree of freedom $d_j$ depends on the sample size.

Given the observed standard errors and degree of freedom. We use an Empirical Bayes approach to estimate the prior distribution $g$ (an unimodal mixture of inverse-gamma distributions) of the variances: [s_j\sim g(\cdot).]

The posterior shrunk estimate for $s_j$: [\tilde{s}_j = E(s_j|\hat{s}_1^2,...,\hat{s}_J^2, \hat{g})] is provided in this package.

A simple simulation example

We simulate $s_1^2,...,s_J^2$ and $\hat{s}_1^2,...,\hat{s}_J^2$ from the above models:

#generate true variances (sd^2) from an inverse-gamma prior
sd <- sqrt(1/rgamma(100,5,5)) 
#observed standard errors are estimates of true sd's
sehat <- sqrt(sd^2*rchisq(100,7)/7)

Then use the main function "vash" to obtain $\tilde{s}_j$, the posterior estimate of $s_j$:

library(vashr)
#run the vash function
fit <- vash(sehat,df=7) 
#vash provides the shrunk estimates
plot(sehat, fit$sd.post, 
     xlab=expression(paste("observed standard errors ",hat(s)[j])), 
     ylab=expression(paste("posterior estimates ",tilde(s)[j]))) 
abline(0,1,lty=2)

The fitted unimodal inverse-gamma mixture prior $g$ can be checked by:

fit$fitted.g

where "pi", "alpha", "beta" and "c" are the mixture proportions, component-wise shape parameters, component-wise rate parameters and uni-mode respectively.

Application example: differential expression analysis (microarray data)

We use the following example to demonstrate how to incoporate our variance estimates into the widely used differential expression analysis pipeline, $limma$ [@smyth2005limma]:

The following microarray data are available from the R package $ecoliLeucine$. The experimental details were reported in [@salmon2003global]:

"The purpose of the work presented here is to identify the network of genes that are differentially regulated by the global E. coli regulatory protein, leucine-responsive regulatory protein (Lrp), during steady state growth in a glucose supplemented minimal salts medium. Lrp is a DNA-binding protein that has been reported to affect the expression of approximately 55 genes."

Gene expression in two E. coli bacteria strains, labelled lrp+ and lrp-, were compared using eight Affymetrix ecoli chips, four chips each for lrp+ and lrp-. Here we perform differential expression analysis between the lrp+ and lrp- strains using the $limma$ package:

library(affy)
library(ecoliLeucine)
library(limma)
data("ecoliLeucine")
eset <- rma(ecoliLeucine)
strain <- c("lrp-","lrp-","lrp-","lrp-","lrp+","lrp+","lrp+","lrp+")
design <- model.matrix(~factor(strain))
colnames(design) <- c("lrp-","lrp+vs-")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
topTable(fit)

Then we moderate the variance estimates of limma by our variance estimates, and compute the moderated gene-specific p-values:

library(vashr)
betahat <- fit$coefficients[,2]
sehat <- fit$stdev.unscaled[,2]*fit$sigma
fit.vash <- vash(sehat=sehat, df=fit$df.residual[1], betahat=betahat)

# compare the gene-specific p-values of limma and vash
plot(fit$p.value[,2], fit.vash$pvalue,
     xlab="limma p-values", ylab="vash p-values")

Application example: differential expression analysis (RNA-seq data)

We use the "pasilla" dataset to demonstrate the usage of this package for RNA-seq differential expression analysis (data available from the R package $pasilla$). This RNA-seq dataset is from an experiment on Drosophila melanogaster cell cultures and investigated the effect of RNAi knock-down of the splicing factor pasilla [@brooks2011pasilla].

We use the pipeline suggested in [@law2014voom], $voom+limma$, to perform differential expression analysis between the two conditions "treated" and "untreated": first use the $voom$ transformation to transform RNA-seq read counts into continuous response (log-cpm), then fit weighted least squares regressions to estimate effects and the de-trended gene-specific variances, and finally use the standard limma pipeline to shrink the de-trended variances.

library("pasilla")
library("Biobase")
library("edgeR")
data("pasillaGenes")
countData <- counts(pasillaGenes)
colData <- pData(pasillaGenes)[,c("condition","type")]
dgecounts <- DGEList(counts=countData, group=factor(colData$condition))
dgecounts <- calcNormFactors(dgecounts)
design <- model.matrix(~colData$condition)
v <- voom(dgecounts,design,plot=FALSE) # voom transformation
fit <- lmFit(v) # fit coefficients
fit <- eBayes(fit) # EB estimation of limma
topTable(fit)

We can moderate the de-trended variance estimates of $voom+limma$ by our variance estimates and obtain the moderated gene-specific p-values by following commands:

library(vashr)
betahat <- fit$coefficients[,2]
sehat <- fit$stdev.unscaled[,2]*fit$sigma
fit.vash <- vash(sehat=sehat, df=fit$df.residual[1], 
                 betahat=betahat, scale=fit$stdev.unscaled[,2])

# compare the gene-specific p-values of limma and vash
plot(fit$p.value[,2], fit.vash$pvalue,
     xlab="voom+limma p-values", ylab="vash p-values")