# From GWAS Summary Statistics to Credible Sets In corrcoverage: Correcting the Coverage of Credible Sets from Bayesian Genetic Fine Mapping

### Z scores to PPs

Maller et al. derive a method to calculate PPs from GWAS summary statistics (Supplementary text) from which the following is based on. Let $\beta_i$ for $i=1,...,k$ SNPs in a genomic region, be the regression coefficient from a single-SNP logistic regression model, quantifying the evidence of an association between SNP $i$ and the disease. Assuming that there is only one CV per region and that this is typed in the study, then if SNP $i$ is causal, $\beta_i\neq 0$ and $\beta_j$ (for $j\neq i$) is non-zero only through LD between SNPs $i$ and $j$. Note that no parametric assumptions are required for $\beta_i$ yet, so we write that it is sampled from some distribution, $\beta_i \sim \text{[ ]}$. The likelihood is then, $$\begin{split} P(D|\beta_i\sim\text{[ ]},\text{ }i\text{ causal}) & = P(D_i |\beta_i\sim\text{[ ]},\text{ }i\text{ causal}) \times P(D_{-i}|D_i,\text{ }\beta_i\sim\text{[ ]},\text{ }i\text{ causal})\ & = P(D_i |\beta_i\sim\text{[ ]},\text{ }i\text{ causal}) \times P(D_{-i}|D_i,\text{ }i\text{ causal})\,, \end{split}$$

\noindent since $D_{-i}$ is independent of $\beta_i$ given $D_i$. Here, $D$ is the genotype data (0, 1 or 2 counts of the minor allele) for the entire genomic region and $i$ is a SNP in the region, such that $D_i$ and $D_{-i}$ are the genotype data at SNP $i$ and at the remaining SNPs in the genomic region, respectively.

Parametric assumptions can now be placed on SNP $i$'s true effect on disease. This is typically quantified as log odds ratio, and is assumed to be sampled from a Gaussian distribution, $\beta_i\sim N(0,W)$, where $W$ is chosen to reflect the researcher's prior belief on the variability of the true OR. Conventionally $W$ is set to $0.2^2$, reflecting a belief that 95\% of odds ratios range from $exp(-1.96\times 0.2)=0.68$ to $exp(1.96\times 0.2)=1.48$.

The posterior probabilities of causality for each SNP $i$ in an associated genomic region with $k$ SNPs can be calculated where, $$PP_i=P(\beta_i \sim N(0,W),\text{ }i \text{ causal}|D)\,, \quad i \in {1,...,k}.$$

Under the assumption that each SNP is \textit{a priori} equally likely to be causal, then $$P(\beta_i \sim N(0,W),\text{ }i\text{ causal})=\dfrac{1}{k}\,, \quad i \in {1,...,k}$$ and Bayes theorem can be used to write \begin{aligned} PP_i=P(\beta_i \sim N(0,W),\text{ }i \text{ causal}|D)\propto P(D|\beta_i\sim N(0,W),\text{ }i\text{ causal}). \end{aligned}

Dividing through by the probability of the genotype data given the null model of no genetic effect, $H_0$, yields a likelihood ratio, $$PP_i\propto \dfrac{P(D|\beta_i \sim N(0,W),\text{ }i \text{ causal)}}{P(D|H_0)},$$

\noindent from which Equation (1) can be used to derive, $$PP_i\propto \frac{P(D_i|\beta_i \sim N(0,W),\text{ }i \text{ causal})}{P(D_i|H_0)}= BF_i\,,$$ where $BF_i$ is the Bayes factor for SNP $i$, measuring the ratio of the probabilities of the data at SNP $i$ given the alternative (SNP $i$ is causal) and the null (no genetic effect) models.

In genetic association studies where sample sizes are usually large, these BFs can be approximated using Wakefield's asymptotic Bayes factors (ABFs). Given that $\hat\beta_i\sim N(\beta_i,V_i)$ and \textit{a priori} $\beta_i\sim N(0,W)$,

$$PP_i\propto BF_i \approx ABF_i=\sqrt{\frac{V_i}{V_i+W}}exp\left(\frac{Z_i^2}{2}\frac{W}{(V_i+W)}\right)\,,$$ where $Z_i^2=\dfrac{\hat\beta_i^2}{V_i}$ is the squared marginal $Z$ score for SNP $i$.

In Bayesian fine-mapping, PPs are calculated for all SNPs in the genomic region and the variants are sorted into descending order of their PP. The PPs are then cumulatively summed until some threshold, $\alpha$, is exceeded. The variants required to exceed this threshold form the $\alpha$-level credible set.

## Try the corrcoverage package in your browser

Any scripts or data that you put into this service are public.

corrcoverage documentation built on Dec. 7, 2019, 1:07 a.m.