# output: rmarkdown::html_vignette
library(knitr)
opts_chunk$set(comment = "#>", collapse = TRUE)
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
  if (!is.null(n <- options$out.lines)) {
    x = knitr:::split_lines(x)
    if (length(x) > n) {
      # truncate the output
      h = c(head(x, n), '....\n','....\n')
      t = c(tail(x, n))
      x = c(h, t)
    }
    x = paste(x, collapse = '\n') # paste first n lines together
  }
  hook_output(x, options)
})

#opts_chunk$set(eval=FALSE)
#Rscript -e "rmarkdown::render('mamba_example/example_mamba_analysis.Rmd')"

MAMBA

A Meta-Analysis Model Based Assessment of Replicability for genome-wide association studies.

The main goal of the MAMBA model is to identify loci which have non-zero replicable associations with a phenotype based on summary statistics from genome-wide association meta-analysis (GWAMA).

Installation

library(devtools)
install_github("dan11mcguire/mamba")

Quick Tutorial

The MAMBA model is fit with the mamba() function, and takes as input a matrix of effect size estimates and a matrix of their variances from GWA meta-analysis.

For a quick start, one could use

d<-generate_data_mamba()

to generate summary statistics according to the MAMBA model (SNPs are simulated independently). The returned data d$betajk is a MxK matrix of effect size estimates, where M is the total number of SNPs and K is the total number of studies, with the jth row and kth column correspond to the effect size estimate of the jth SNP in the kth study. Similarly, d$sjk2 is a Mxk matrix of the effect size estimate variances.

The MAMBA model could then be fit by:

mod<-mamba(betajk=d$beta, sjk2=d$sjk2)

The mamba() function then fits a two-level mixture model to the data, and calculates a posterior-probability-of-replicability (PPR) for each SNP. This is given in mod$ppr.

For SNPs with low PPR, the model also estimates a posterior probability that a particular summary statistic is an "outlier". We use the term "outlier" here to indicate a summary statistic which may be extreme or significant due only to artifacts. This is given in the mod$outliermat matrix.

Detailed example

Step 0: Initial GWAMA

To start, we conduct an initial fixed-effects meta-analysis to identify loci of interest which we want to validate with the MAMBA model. This could be done using METAL, rareGWAMA, or some other standard-type meta-analysis software for GWAS. As an example, we have the helper function make_mamba_summary_files() which takes as input a set of summary statistic files, and conducts inverse-variance-weighted fixed effects meta-analysis. We can use some example summary statistic files available as external package datasets for the example analysis.

library(mamba)
library(data.table)
library(parallel)

datadir<-system.file("extdata", package="mamba")
datafiles<-paste0(datadir, "/", grep("study_.*assoc", list.files(datadir),value=T))

fread(datafiles[1])[1:5] ### format of example summary statistic files 

## holdout the first dataset to assess replicability
discovery_datasets<-setdiff(datafiles, grep("study_1.assoc", datafiles, value=T))
exmpl<-make_mamba_summary_files(summary_files=discovery_datasets, 
                                standardize_beta_and_s2_datasets=TRUE,
                                beta_name="betaj", s2_name="sj2")

assoc<-exmpl$assoc
beta<-exmpl$beta
s2<-exmpl$s2
rm(exmpl)

assoc contains the meta-analysis results (in plink .assoc file format, with some additional columns) which can be passed directly to plink in step2.

https://www.cog-genomics.org/plink/1.9/formats

assoc[1:3] 

The beta and s2 objects will later be used as input to the mamba() function.
Each row represents the effect size estimates (or their variances) of a SNP, and each column represents a particular study. Not every SNP is analyzed in every study, missing values are represented by NA.

beta[1:2]
s2[1:2]

We can subset the analysis to SNPs which were analzed in at least 4 studies. We write out the association file to filepath: ivw_met_chr14.tsv.

assoc<-assoc[k >=4]
beta<-beta[snp %in% assoc[,SNP]]
s2<-s2[snp %in% assoc[,SNP]]


fwrite(assoc, file="ivw_met_chr14.tsv", sep="\t")

Step 1a: Clumping GWAMA results

https://www.cog-genomics.org/plink/1.9/postproc#clump

https://www.cog-genomics.org/plink/1.9/formats

An example command for clumping meta-analysis results in plink is below:

```{bash,eval=FALSE}

plink \ --bed "$refpanel".bed \ --bim "$refpanel".bim \ --fam "$refpanel".fam \ --clump ivw_met_chr14.tsv \ --clump-p1 1e-5 \ --clump-kb 500 \ --clump-p2 1 \ --clump-r2 0.1 \ --out example_chr14

outputted file with clumped association statistics: example_chr14.clumped

An output file from clumping using the above command is available as an external package dataset.

```r

datadir<-system.file("extdata", package="mamba")
clump_file<-paste0(datadir, "/", "example_chr14.clumped")
tail(fread(clump_file))

Step 1b: Prune variants from a reference panel, combine with clumped SNPs.

https://www.cog-genomics.org/plink/1.9/ld

An example command for pruning from a reference panel using plink is shown below.

```{bash,eval=FALSE}

plink \ --bed "$refpanel".bed \ --bim "$refpanel".bim \ --fam "$refpanel".fam \ --indep-pairwise 500kb 1 0.1 \ --maf 0.01 \ --out example_chr14

outputted file with pruned snps: example_chr14.prune.in

A set of SNPs pruned from HRC reference panel using the below commands (in *CHR:BP_REF_ALT* format) are available as a default in the MAMBA package.

```r
data(prunedsnps_HRC, package="mamba")
prunevars[1:5]

We can use the helper function select_mamba_loci() to combine the pruned and clumped variants, removing any pruned SNPs within 500kb of a clumped SNP.

mamba_snps<-select_mamba_loci(meta_file="ivw_met_chr14.tsv", clump_file=clump_file)

betajk<-beta[snp %in% mamba_snps[,SNP]]
sjk2<-s2[snp %in% mamba_snps[,SNP]]

Step 2: Fit the MAMBA model.

mod<-mamba(betajk=betajk[,-c("chr", "bp", "snp", "ref", "alt"),with=F], 
           sjk2=sjk2[,-c("chr", "bp", "snp", "ref", "alt"),with=F],
           snpids=betajk[,snp])

Assessing the output

The estimated posterior-probability-of-replicability (PPR) and effect sizes are contained in the mod$ppr and mod$mu.hat objects, which we reattach to our initial fixed effects meta-analysis table for comparison.

mamba_snps[,ppr:=mod$ppr]
mamba_snps[,mu_hat:=mod$mu.hat]

We can use the mod$outliermat object to identify the "most probable outlier study" for non-replicable SNPs. The columns in mod$outliermat assess posterior probability of an outlier given that true SNP association is zero and non-replicable. The number is only interpretable for SNPs with low PPR, where higher values indicate a summary statistic which dominated the meta-analysis result.

mod$outliermat[ppr < 0.5][1:3]
top_outlier<-melt(mod$outliermat, 
                  id.vars=c("ppr", "snp"), 
                  variable.factor=F,
                  value.name="outlier_prob",
                  variable.name="outlier_study")[order(-outlier_prob)][,head(.SD,1),by=.(snp)]

mamba_snps<-merge(mamba_snps, top_outlier[,-c("ppr"),with=F], by.x="SNP", by.y="snp",sort=FALSE)

Estimate FDR from PPR

Dr. Brad Efron has written extensively on the connection between empirical bayes posterior probability and false discovery rates, for example here: Empirical Bayes Analysis of a Microarray Experiment

Below we calculate an estimated FDR based on posterior probability (PPR) from the MAMBA model.

mamba_snps[,ppr_rank:=rank(-ppr,ties.method="max")]
mamba_snps[order(-ppr),Fdr:=cumsum(1-ppr)/ppr_rank]
mamba_snps[,Fdr:=max(Fdr),by=.(ppr_rank)]
mamba_snps[,ppr_rank:=NULL]

mamba_snps[order(-ppr),.(SNP, p_fe=P, ppr, Fdr)][1:5]

Calculate p-values

We use parametric bootstrap procedure to calculate p-values based on estimated PPR.
Basically, this means

  1. Simulating data based on the parameters we estimated in mod.

    • The function get_null_scores() using the call below outputs PPR for non-replicable SNPs into the file mamba_pval/chr14.nullscores.txt.

    • Note: The argument total_null_scores will affects the minimum p-value you can possibly obtain. For assessing a SNP at genome-wide significance, you need at least $20\times 10^6$ null scores, $\frac{1}{(20\times 10^6+1)} \approx 5\times 10^{-8}$

    • Note: clean_any_existing_scores=FALSE. When set to FALSE, simultaneous calls to get_null_scores() can dump output to the same file mamba_pval/chr14.nullscores.txt. This can speed up computation when additional computing resources are available.

  2. Comparing the PPR from the simulated non-replicable SNPs (data generated under the null hypothesis), with the estimated PPR from our original data. For a SNP $i$ with PPR score=$y$ and a set of simulated null SNP PPR scores $x$, the bootstrap p-value can be calculated as
    $$p_i = \frac{1 + \sum (y < x)}{1 + |x|}$$ , where $|x|$ is the number of simulated null SNP PPR scores.

For this short example, we settle for 1 million (10^6) scores. With clean_any_existing_scores=FALSE, we see some have already been generated by a prior call to the function.

system("if [ ! -d mamba_pvals ]; then mkdir mamba_pvals/ ; fi ")
null_scores<-mamba:::get_null_scores(model=mod, 
                                     s2=sjk2[,-c("chr", "bp", "ref", "alt", "snp"),with=F], 
                                     total_null_scores=10^6, 
                                     out="mamba_pvals/chr14",
                                     clean_any_existing_scores=FALSE)
mod$pval<-sapply(mod$ppr, 
                 function(score) (1 + sum(score < null_scores[[1]]))/(1 + length(null_scores[[1]])))

mamba_snps[,mamba_p:=mod$pval]
mamba_snps[order(mamba_p),.(SNP, mamba_p, ppr, Fdr)][1:10]

Comparison with holdout dataset

We can make some further comparisons with the initial fixed effects result with the results from the MAMBA model by looking at the holdout dataset excluded from our analysis.

### attach held-out dataset to results
holdout<-fread(paste0(datadir, "/study_1.assoc"))

mamba_snps<- merge(mamba_snps, 
             holdout[,.(snp=paste0(chr, ":", bp, "_", ref, "_", alt), beta_rep=betaj, p_rep=p)], 
             by.x="SNP", by.y="snp", all.x=T, sort=F)


### Proportion of SNPs with PPR > 0.5 which replicate direction of effect in the holdout cohort 
mamba_snps[ppr > 0.5][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))]

### Proportion of SNPs with PPR > 0.5 which replicate direction of effect and have 
### replication p-value < 0.01 in the holdout cohort 
mamba_snps[ppr > 0.5][!is.na(p_rep),mean(p_rep < .01 & (sign(beta_rep)==sign(mu_hat)))]


### Comparing replication using a fixed effects p-value threshold corresponding to the 
### PPR cutoff=0.5 used above.
max_fe_p<-mamba_snps[ppr > 0.5][,max(P)]
mamba_snps[P < max_fe_p][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))]
mamba_snps[P < max_fe_p][!is.na(p_rep),mean(p_rep < .01 & (sign(beta_rep)==sign(mu_hat)))]


### Replication for SNPs with likely outlier summary statistics 
mamba_snps[outlier_prob > 0.5][ppr < 0.5][index_snp==1][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))]
mamba_snps[outlier_prob > 0.5][ppr < 0.5][index_snp==1][!is.na(p_rep),mean(p_rep < .05 & (sign(beta_rep)==sign(mu_hat)))]


### Correlation between mamba effects size estimates 
### and replication effect size estimates, 
mamba_snps[!is.na(beta_rep),cor(mu_hat, beta_rep)] 

### Correlation between fixed effects size estimates 
### and replication effect size estimates, 
mamba_snps[!is.na(beta_rep), cor(BETA, beta_rep)]


dan11mcguire/mamba documentation built on Nov. 10, 2020, 12:37 a.m.