GUESSFM Introduction

Chris Wallace // web // email

Introduction

GUESS is software for Bayesian variable selection and model averaging using a stochastic search algorithm to visit the most likely models via MCMC. There also exists an R package, R2GUESS, which is a wrapper for running GUESS, and provides useful diagnostic and summary functions.

This package, GUESSFM, aims to extend GUESS for fine mapping, defined here as the task of identifying the most likely set of causal variants. The are particular challenges for fine mapping, mostly related to the density of SNP markers. Nonetheless, this vignette will at first focus on an example dataset that comes with R2GUESS.

This vignettes introduces GUESSFM, and tries to do a sample run with simulated data. The other vignettes visit some topics in more depth:

plotting and SNP groups.

R2GUESS Strategy

Although GUESS uses g-priors, which inhibits visiting models with SNPs in high LD, we have found that running GUESS with SNPs in complete LD can lead to instability. R2GUESS was created purely to allow running GUESS on a tagged set of SNPs to approximate the posterior model space, and then expand the most interesting models to include their tags. It is perhaps useful to give an example of expanding a model.

Suppose we begin tagging, and determine a couple of tag groups as follows:

Tag SNPs in tag group
\(A\) \(A_1\), \(A_2\)
\(B\) \(B_1\)

R2GUESS will remove SNPs A1, A2 and B1, and will save a file containing an object of class tags containting the information in the above table. Then, suppose after GUESS has run on the set of tag SNPs, model ((A,B)) is one of the most interesting. GUESS FM will `expand' this model to the set of models ({(A,B), (A_1,B), (A_2,B), (A,B_1), (A_1,B_1), (A_2,B_1)}).

We assess each model using Bayes Factors. The Bayes Factor for model (M_i) is

$$BF_{i0} = \frac{P(M_i | \text{data} )}{P(M_0 | \text{data})}.$$

In a fully Bayesian analysis, the individual model probabilities are calculated by integrating over prior distributions for the model parameters (regression coefficients, for example). This is what GUESS does. However, it assumes a linear model when often we will want to fine map a disease trait and therefore use a logistic model. Further, we cannot go back and run the expanded models through GUESS. Instead, we use Approximate Bayes Factors, based on the BIC:

$$-2 \log{ABF_{i0}} = BIC(M_i) - BIC(M_0).$$

These are calculated within R via the glm and BIC functions.

At what threshold should you tag? Probably, as high as you can go without inducing instability in GUESS. We have found (r^2=0.99) to work in some large datasets, but your mileage may vary. In the examples below we use a lower threshold purely for demonstration.

Simulate some data

We start with using some sample data from the snpStats package including 20 SNPs, and simulating a quantitative trait that depends on 3 causal SNPs.

library(snpStats)
data(for.exercise, package="snpStats")
set.seed(12346)
X <- snps.10[,101:120]
n <- nrow(X)
causal <- c("rs1555897","rs7069505")
Y <- rnorm(n,mean=as.numeric(X[,causal[1]]))*sqrt(0.2) +
  rnorm(n,mean=as.numeric(X[,causal[2]]))*sqrt(0.2) +
  rnorm(n)*sqrt(0.6)

X contains some missing genotypes, but no SNPs with such a low call rate we would worry in a large study. Still, the rest of the analysis is easier to interpret for the purposes of a vignette if we fill in the missing values.

library(GUESSFM)
summary(col.summary(X))
X <- impute.missing(X)
summary(col.summary(X))

Looking at the LD, we see this is a region in which D' (above the diagonal) is typically high, whilst (r^2) can be high between some SNPs, and with moderately strong (r^2 \simeq 0.7) between two of our causal SNPs:

ld <- show.ld(X=X)

Running GUESS and reading output

First, via GUESSFM's wrapper, tagging at (r^2=0.8). This is considerably lower than suggested above, and is used here purely for demonstration as there is limited strong LD in this dataset.

## THIS DIRECTORY WILL HOLD ALL THE GUESS INPUT AND OUTPUT 
## AND WILL BE CREATED IF IT DOESN'T ALREADY EXIST
mydir <- tempfile() 
## NB, IF GUESS IS NOT ON YOUR PATH, SUPPLY THE FULL PATH IN AN ARGUMENT
## guess.command="/path/to/GUESS"
run.bvs(X,Y,gdir=mydir,
        tag.r2=0.95,           # maximum r2 between SNPs to be modelled
        guess.command="GUESS", # /path/to/GUESS
        nexp=3,                # expected number of causal variants, an overestimate
        nsave=1000)            # number of best models to save

This can take a (long) while. For the purposes of this vignette, we will load the results from an existing run.

mydir <-  system.file("extdata",package="GUESSFM")

## what files were created?
list.files(mydir)

## read output with GUESSFM
d <- read.snpmod(mydir)

## examine the best models and SNPs with greatest marginal support within the tagged data.
best.models(d)
best.snps(d)

Huh. GUESS has selected one of our causal SNPs, but not both. Why? We have a clue from the ld matrix:

sel=rownames(best.snps(d))
ld[c(sel,causal),c(sel,causal)]

So the selected SNP rs11253451 has (r^2=0.996) and (D'=1) with the causal but unselected SNP rs1555897. That would explain it.

Note that both best.models and best.snps allow you to specify thresholds for how to determine "best". See their help pages for details.

The tags created within the run.bvs function are saved to a tags.RData file under mydir and can be examined.

(load(file.path(mydir,"tags.RData")))
tags
tagsof(tags,causal)
taggedby(tags,sel)

Indeed, rs11253451 is a tag for rs1555897.

NB: to see more about how to manipulate tags and groups objects, see the vignette

vignette("groups",package="GUESSFM")

Expanding

Tagging has allowed us to shrink the model space, by assuming that models with SNPs in very high LD will have very similar likelihoods, but for fine mapping we really do want to evaluate each and every model. So, having chosen our best set of models within the shrunken space, we need to expand each of them to all the possible models they tag:

dx<-expand.tags(d,tags)

Refitting

The expanded models above have all been assigned the log Bayes Factor for their nearest tag model. This isn't terrible, in practice, but if you care about fine mapping then you should get a more precise answer by refitting the most likely models individually. Note this is especially important if you have a binomial outcome, as GUESS has been run using a linear model.

Here, we take the set of most likely models which collectively capture 50% of the posterior support from GUESS, after expansion. Loading the speedglm library makes the fitting faster, and we calculate the approximate Bayes Factors using the BIC approximation. To do this, we also need to supply some information about our prior for the number of causal variants in the model.

best <- best.models(dx,cpp.thr=0.9)
library(speedglm)
abf <- abf.calc(y=Y,x=X,models=best$str,family="gaussian")
sm <- abf2snpmod(abf,expected=3)

Now we can explore the best SNPs and models in the tagged data, the expanded data, and the refitted data

best.snps(d)
best.snps(dx)
best.snps(sm)

We see that on expanding the tags we pick up the true causal variant, together with two more extraneous SNPs, but on refitting the model we see only the two causal SNPs and rs11253451, which is in very high LD with rs1555897.

SNP groups

A formal way to group SNPs in LD with posterior support is to use the snp.picker function, which can also produce a plot to show how it's working.
We do not expect that we will be able to discriminate, statistically, between highly correlated variants. Instead, the posterior support is likely to be diluted across such sets of variants. To group such SNPs, we used the marginal posterior probabilities of inclusion (MPPI) for each SNP, and applied the following algorithm:

  1. Pick the index SNP with maximum MPPI
  2. Order remaining SNPs by (r^2) with index SNP
  3. Exclude SNPs which co-occur in models with the index SNP (joint MPPI (>0.02))
  4. Step away from the index SNP in order of decreasing (r^2), adding SNPs to its group until (\text{MPPI}<0.001) for two SNPs in a row (NB, these SNPs will not be added to the SNP group), or until (r^2<0.5)
  5. Remove this set of SNPs and return to step 1 until no SNP remains with (\operatorname{MPPI}>0.01).

We summarize the support for any group of SNPs by the grouped marginal posterior probability of inclusion, or gMPPI.

sp <- snp.picker(sm,X)
summary(sp)
plot(sp)

Plotting

It is useful to assess the steps of any analysis by looking at the data. With such a large number of models, the best way is to plot aspects of the data. For that reason, GUESSFM contains lots of plotting functions, described in a separate vignette. To see it, do:

vignette("plotting",package="GUESSFM")

Parallelism

Looping over many, many models can be made quicker by parallel processing. GUESSFM does this by means of calls to the mclapply function in the parallel package. By default, the parallel package sets itself up to use two cores. You can change this by setting the option mc.cores. Eg, if you have 20 cores on your machine, you might set

options(mc.cores=16)

to use 16 of this for R, and leave the remainder free to run other processes.

Functions which make use of this (and over which you might then not to use mclapply are:

expand.tags

Using an existing R2GUESS run

You can convert a run from R2GUESS into a snpmod object with:

## read output using a convenience wrapper for as.ESS.object()
## this returns an object of class ESS, used by R2GUESS
ess <- read.ess(mydir)
str(ess)

## GUESSFM maps snp numbers to names via a decode vector
decode <- structure(colnames(X),names=as.numeric(1:ncol(X)))

## create a snpmod
gfm2 <- ess2snpmod(ess)
best.models(gfm2)

Now you can apply all the plotting functions etc in GUESSFM, but without the tagging strategy, you won't be able to do the expansion.



chr1swallace/GUESSFM documentation built on May 13, 2019, 6:17 p.m.