knitr::opts_chunk$set(echo = TRUE,comment = "#",fig.width = 5, fig.height = 4,fig.align = "center", eval = FALSE)
This vignette introduces the important topic of using data-driven
covariance matries in mashr
. You should read the introductory
vignette before this.
Recall the four major steps in a mash analysis:
Read in the data
Set up the covariance matrices to be used
Fit the model
Extract posterior summaries and other quantities of interest
Here we pay more attention to Step 2, which can be split into two steps:
Set up the "canonical" covariance matrices.
Set up the "data-driven" covariance matrices.
The first of these steps ("canonical" covariance matrices) is
straightforward using cov_canonical
, as illustrated in
the introductory vignette.
Setting up the data-driven matrices is slightly more complex, and there are multiple possible approaches. This vignette illustrates one approach.
First we simulate some data for illustration (see the the introductory vignette for more details):
library(ashr) library(mashr) set.seed(1) simdata = simple_sims(500,5,1)
Although the user is free to set up data driven covariance matrices using any method that they would like, typically we suggest the following three-step strategy:
Locate some strong signals. For example, here we do this within
R by running a "condition-by-condition" using mash_1by1
, but you could
do this as a preprocessing step in other software - for example
in eQTL studies you might instead put together a separate dataset
containing the test results for only the "top" eQTL or eQTLs in each gene.
Use methods such as PCA or factor analysis on these top signals to
compute some initial data-driven covariance matrices (e.g. using
cov_pca
).
Use these data-driven covariance matrices as initializations for
the extreme deconvolution (ED) algorithm (using cov_ed
), to get some
refined data-driven covariance matrix estimates.
The ED step is helpful primarily because the second step will often estimate the covariances of the observed data (Bhat) whereas what is required is the covariances of the actual underlying effects (B), and this is what ED estimates. However, ED can be quite sensitive to initialization, and so the goal of the second step is to provide a good initialization.
If your entire data set (matrix of all tests in all conditions) is
not too big (eg <100k tests say) then you can simply do this by
setting up the entire data set as a mash data object
(using mash_set_data
), and then
running a condition-by-condition (1by1)
analysis on all the data.
For example, here we select the strong signals as those with lfsr<0.05 in
any condition in the 1by1 analysis.
data = mash_set_data(simdata$Bhat, simdata$Shat) m.1by1 = mash_1by1(data) strong = get_significant_results(m.1by1,0.05)
This sets up a vector strong
containing the indices corresponding
to the "significant" rows of the test results in data
. This
vector is used in subsequent functions below to specify which
rows of data
to use.
There are various approaches to this; here we just illustrate the
simplest and quickest (but probably not the best), which is to use
PCA. Specifically here we use the function cov_pca
to produce
covariance matrices based on the top 5 PCs of the strong signals. The
result is a list of 6 covariance matrices: one based on all 5 PCs, and
the others each based on one PC.
U.pca = cov_pca(data,5,subset=strong) print(names(U.pca))
The function cov_ed
is used to apply the ED algorithm from a
specified initialization (here U.pca
) and to a specified subset of
signals.
U.ed = cov_ed(data, U.pca, subset=strong)
After all this we are ready to run mash
using the data driven
matrices. Remember the Crucial Rule that we must fit mash to all
tests - do not use only the strong
subset here!!
m.ed = mash(data, U.ed) print(get_loglik(m.ed),digits = 10)
From the fit we see that the fit using the data-driven covariances is not as good as when we used the canonical covariances (which was -16120.32; from the introductory vignette). This is expected for this simulation because the simulation actually used the canonical covariances!
In general we recommend running mash
with both data-driven and
canonical covariances. You could do this by combining the data-driven
and canonical covariances as in this code:
U.c = cov_canonical(data) m = mash(data, c(U.c,U.ed)) print(get_loglik(m),digits = 10)
For an example with simulations that do not follow the standard canonical matrices see here.
print(sessionInfo())
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.