GUESS is software for Bayesian variable selection and model averaging using a stochastic search algorithm to visit the most likely models via MCMC. There also exists an R package, R2GUESS, which is a wrapper for running GUESS, and provides useful diagnostic and summary functions.
This package, GUESSFM, aims to extend GUESS for fine mapping, defined here as the task of identifying the most likely set of causal variants. The are particular challenges for fine mapping, mostly related to the density of SNP markers. Nonetheless, this vignette will at first focus on an example dataset that comes with R2GUESS.
This vignettes introduces GUESSFM, and tries to do a sample run with simulated data. The other vignettes visit some topics in more depth:
plotting and SNP groups.
Although GUESS uses g-priors, which inhibits visiting models with SNPs in high LD, we have found that running GUESS with SNPs in complete LD can lead to instability. R2GUESS was created purely to allow running GUESS on a tagged set of SNPs to approximate the posterior model space, and then expand the most interesting models to include their tags. It is perhaps useful to give an example of expanding a model.
Suppose we begin tagging, and determine a couple of tag groups as follows:
Tag | SNPs in tag group |
\(A\) | \(A_1\), \(A_2\) |
\(B\) | \(B_1\) |
R2GUESS will remove SNPs A1, A2 and B1, and will save a file
containing an object of class tags
containting the information in
the above table. Then, suppose after GUESS has run on the set of tag
SNPs, model ((A,B)) is one of the most interesting. GUESS FM will
`expand' this model to the set of models ({(A,B), (A_1,B), (A_2,B),
(A,B_1), (A_1,B_1), (A_2,B_1)}).
We assess each model using Bayes Factors. The Bayes Factor for model (M_i) is
$$BF_{i0} = \frac{P(M_i | \text{data} )}{P(M_0 | \text{data})}.$$
In a fully Bayesian analysis, the individual model probabilities are calculated by integrating over prior distributions for the model parameters (regression coefficients, for example). This is what GUESS does. However, it assumes a linear model when often we will want to fine map a disease trait and therefore use a logistic model. Further, we cannot go back and run the expanded models through GUESS. Instead, we use Approximate Bayes Factors, based on the BIC:
$$-2 \log{ABF_{i0}} = BIC(M_i) - BIC(M_0).$$
These are calculated within R via the glm
and BIC
functions.
At what threshold should you tag? Probably, as high as you can go without inducing instability in GUESS. We have found (r^2=0.99) to work in some large datasets, but your mileage may vary. In the examples below we use a lower threshold purely for demonstration.
We start with using some sample data from the snpStats package including 20 SNPs, and simulating a quantitative trait that depends on 3 causal SNPs.
library(snpStats) data(for.exercise, package="snpStats") set.seed(12346) X <- snps.10[,101:120] n <- nrow(X) causal <- c("rs1555897","rs7069505") Y <- rnorm(n,mean=as.numeric(X[,causal[1]]))*sqrt(0.2) + rnorm(n,mean=as.numeric(X[,causal[2]]))*sqrt(0.2) + rnorm(n)*sqrt(0.6)
X
contains some missing genotypes, but no SNPs with such a low call
rate we would worry in a large study. Still, the rest of the analysis
is easier to interpret for the purposes of a vignette if we fill in
the missing values.
library(GUESSFM) summary(col.summary(X)) X <- impute.missing(X) summary(col.summary(X))
Looking at the LD, we see this is a region in which D' (above the diagonal) is typically high, whilst (r^2) can be high between some SNPs, and with moderately strong (r^2 \simeq 0.7) between two of our causal SNPs:
ld <- show.ld(X=X)
First, via GUESSFM's wrapper, tagging at (r^2=0.8). This is considerably lower than suggested above, and is used here purely for demonstration as there is limited strong LD in this dataset.
## THIS DIRECTORY WILL HOLD ALL THE GUESS INPUT AND OUTPUT ## AND WILL BE CREATED IF IT DOESN'T ALREADY EXIST mydir <- tempfile() ## NB, IF GUESS IS NOT ON YOUR PATH, SUPPLY THE FULL PATH IN AN ARGUMENT ## guess.command="/path/to/GUESS" run.bvs(X,Y,gdir=mydir, tag.r2=0.95, # maximum r2 between SNPs to be modelled guess.command="GUESS", # /path/to/GUESS nexp=3, # expected number of causal variants, an overestimate nsave=1000) # number of best models to save
This can take a (long) while. For the purposes of this vignette, we will load the results from an existing run.
mydir <- system.file("extdata",package="GUESSFM") ## what files were created? list.files(mydir) ## read output with GUESSFM d <- read.snpmod(mydir) ## examine the best models and SNPs with greatest marginal support within the tagged data. best.models(d) best.snps(d)
Huh. GUESS has selected one of our causal SNPs, but not both. Why? We have a clue from the ld matrix:
sel=rownames(best.snps(d)) ld[c(sel,causal),c(sel,causal)]
So the selected SNP rs11253451 has (r^2=0.996) and (D'=1) with the causal but unselected SNP rs1555897. That would explain it.
Note that both best.models
and best.snps
allow you to specify thresholds for
how to determine "best". See their help pages for details.
The tags created within the run.bvs
function are saved to a
tags.RData
file under mydir
and can be examined.
(load(file.path(mydir,"tags.RData"))) tags tagsof(tags,causal) taggedby(tags,sel)
Indeed, rs11253451 is a tag for rs1555897.
NB: to see more about how to manipulate tags and groups objects, see the vignette
vignette("groups",package="GUESSFM")
Tagging has allowed us to shrink the model space, by assuming that models with SNPs in very high LD will have very similar likelihoods, but for fine mapping we really do want to evaluate each and every model. So, having chosen our best set of models within the shrunken space, we need to expand each of them to all the possible models they tag:
dx<-expand.tags(d,tags)
The expanded models above have all been assigned the log Bayes Factor for their nearest tag model. This isn't terrible, in practice, but if you care about fine mapping then you should get a more precise answer by refitting the most likely models individually. Note this is especially important if you have a binomial outcome, as GUESS has been run using a linear model.
Here, we take the set of most likely models which collectively capture 50% of the posterior support from GUESS, after expansion. Loading the speedglm
library makes the fitting faster, and we calculate the approximate Bayes Factors using the BIC approximation. To do this, we also need to supply some information about our prior for the number of causal variants in the model.
best <- best.models(dx,cpp.thr=0.9) library(speedglm) abf <- abf.calc(y=Y,x=X,models=best$str,family="gaussian") sm <- abf2snpmod(abf,expected=3)
Now we can explore the best SNPs and models in the tagged data, the expanded data, and the refitted data
best.snps(d) best.snps(dx) best.snps(sm)
We see that on expanding the tags we pick up the true causal variant, together with two more extraneous SNPs, but on refitting the model we see only the two causal SNPs and rs11253451, which is in very high LD with rs1555897.
A formal way to group SNPs in LD with posterior support is to use the snp.picker
function, which can also produce a plot to show how it's working.
We do not expect that we will be able to discriminate, statistically,
between highly correlated variants. Instead, the posterior support is
likely to be diluted across such sets of variants. To group such
SNPs, we used the marginal posterior probabilities of inclusion (MPPI)
for each SNP, and applied the following algorithm:
We summarize the support for any group of SNPs by the grouped marginal posterior probability of inclusion, or gMPPI.
sp <- snp.picker(sm,X) summary(sp) plot(sp)
It is useful to assess the steps of any analysis by looking at the data. With such a large number of models, the best way is to plot aspects of the data. For that reason, GUESSFM contains lots of plotting functions, described in a separate vignette. To see it, do:
vignette("plotting",package="GUESSFM")
Looping over many, many models can be made quicker by parallel
processing. GUESSFM does this by means of calls to the mclapply
function in the parallel
package. By default,
the parallel
package sets itself up to use two cores. You can
change this by setting the option mc.cores
. Eg, if you have 20
cores on your machine, you might set
options(mc.cores=16)
to use 16 of this for R, and leave the remainder free to run other processes.
Functions which make use of this (and over which you might then not to
use mclapply
are:
expand.tags
You can convert a run from R2GUESS into a snpmod object with:
## read output using a convenience wrapper for as.ESS.object() ## this returns an object of class ESS, used by R2GUESS ess <- read.ess(mydir) str(ess) ## GUESSFM maps snp numbers to names via a decode vector decode <- structure(colnames(X),names=as.numeric(1:ncol(X))) ## create a snpmod gfm2 <- ess2snpmod(ess) best.models(gfm2)
Now you can apply all the plotting functions etc in GUESSFM, but without the tagging strategy, you won't be able to do the expansion.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.