# output: rmarkdown::html_vignette library(knitr) opts_chunk$set(comment = "#>", collapse = TRUE) hook_output = knit_hooks$get('output') knit_hooks$set(output = function(x, options) { if (!is.null(n <- options$out.lines)) { x = knitr:::split_lines(x) if (length(x) > n) { # truncate the output h = c(head(x, n), '....\n','....\n') t = c(tail(x, n)) x = c(h, t) } x = paste(x, collapse = '\n') # paste first n lines together } hook_output(x, options) }) #opts_chunk$set(eval=FALSE) #Rscript -e "rmarkdown::render('mamba_example/example_mamba_analysis.Rmd')"
A Meta-Analysis Model Based Assessment of Replicability for genome-wide association studies.
The main goal of the MAMBA model is to identify loci which have non-zero replicable associations with a phenotype based on summary statistics from genome-wide association meta-analysis (GWAMA).
library(devtools) install_github("dan11mcguire/mamba")
The MAMBA model is fit with the mamba()
function, and takes as input a matrix of effect size estimates and a matrix of their variances from GWA meta-analysis.
For a quick start, one could use
d<-generate_data_mamba()
to generate summary statistics according to the MAMBA model (SNPs are simulated independently). The returned data d$betajk
is a MxK matrix of effect size estimates, where M is the total number of SNPs and K is the total number of studies, with the jth row and kth column correspond to the effect size estimate of the jth SNP in the kth study. Similarly, d$sjk2
is a Mxk matrix of the effect size estimate variances.
The MAMBA model could then be fit by:
mod<-mamba(betajk=d$beta, sjk2=d$sjk2)
The mamba()
function then fits a two-level mixture model to the data, and calculates a posterior-probability-of-replicability (PPR) for each SNP. This is given in mod$ppr
.
For SNPs with low PPR, the model also estimates a posterior probability that a particular summary statistic is an "outlier". We use the term "outlier" here to indicate a summary statistic which may be extreme or significant due only to artifacts. This is given in the mod$outliermat
matrix.
To start, we conduct an initial fixed-effects meta-analysis to identify loci of interest which we want to validate with the MAMBA model. This could be done using METAL, rareGWAMA, or some other standard-type meta-analysis software for GWAS. As an example, we have the helper function make_mamba_summary_files()
which takes as input a set of summary statistic files, and conducts inverse-variance-weighted fixed effects meta-analysis. We can use some example summary statistic files available as external package datasets for the example analysis.
library(mamba) library(data.table) library(parallel) datadir<-system.file("extdata", package="mamba") datafiles<-paste0(datadir, "/", grep("study_.*assoc", list.files(datadir),value=T)) fread(datafiles[1])[1:5] ### format of example summary statistic files ## holdout the first dataset to assess replicability discovery_datasets<-setdiff(datafiles, grep("study_1.assoc", datafiles, value=T)) exmpl<-make_mamba_summary_files(summary_files=discovery_datasets, standardize_beta_and_s2_datasets=TRUE, beta_name="betaj", s2_name="sj2") assoc<-exmpl$assoc beta<-exmpl$beta s2<-exmpl$s2 rm(exmpl)
assoc
contains the meta-analysis results (in plink
.assoc file format, with some additional columns) which can be passed directly to plink
in step2.
https://www.cog-genomics.org/plink/1.9/formats
assoc[1:3]
The beta
and s2
objects will later be used as input to the mamba()
function.
Each row represents the effect size estimates (or their variances) of a SNP, and each column represents a particular study. Not every SNP is analyzed in every study, missing values are represented by NA
.
beta[1:2] s2[1:2]
We can subset the analysis to SNPs which were analzed in at least 4 studies. We write out the association file to filepath: ivw_met_chr14.tsv
.
assoc<-assoc[k >=4] beta<-beta[snp %in% assoc[,SNP]] s2<-s2[snp %in% assoc[,SNP]] fwrite(assoc, file="ivw_met_chr14.tsv", sep="\t")
https://www.cog-genomics.org/plink/1.9/postproc#clump
https://www.cog-genomics.org/plink/1.9/formats
An example command for clumping meta-analysis results in plink is below:
```{bash,eval=FALSE}
plink \ --bed "$refpanel".bed \ --bim "$refpanel".bim \ --fam "$refpanel".fam \ --clump ivw_met_chr14.tsv \ --clump-p1 1e-5 \ --clump-kb 500 \ --clump-p2 1 \ --clump-r2 0.1 \ --out example_chr14
An output file from clumping using the above command is available as an external package dataset. ```r datadir<-system.file("extdata", package="mamba") clump_file<-paste0(datadir, "/", "example_chr14.clumped") tail(fread(clump_file))
https://www.cog-genomics.org/plink/1.9/ld
An example command for pruning from a reference panel using plink is shown below.
```{bash,eval=FALSE}
plink \ --bed "$refpanel".bed \ --bim "$refpanel".bim \ --fam "$refpanel".fam \ --indep-pairwise 500kb 1 0.1 \ --maf 0.01 \ --out example_chr14
A set of SNPs pruned from HRC reference panel using the below commands (in *CHR:BP_REF_ALT* format) are available as a default in the MAMBA package. ```r data(prunedsnps_HRC, package="mamba") prunevars[1:5]
We can use the helper function select_mamba_loci()
to combine the pruned and clumped variants, removing any pruned SNPs within 500kb of a clumped SNP.
mamba_snps<-select_mamba_loci(meta_file="ivw_met_chr14.tsv", clump_file=clump_file) betajk<-beta[snp %in% mamba_snps[,SNP]] sjk2<-s2[snp %in% mamba_snps[,SNP]]
mod<-mamba(betajk=betajk[,-c("chr", "bp", "snp", "ref", "alt"),with=F], sjk2=sjk2[,-c("chr", "bp", "snp", "ref", "alt"),with=F], snpids=betajk[,snp])
The estimated posterior-probability-of-replicability (PPR) and effect sizes are contained in the mod$ppr
and mod$mu.hat
objects, which we reattach to our initial fixed effects meta-analysis table for comparison.
mamba_snps[,ppr:=mod$ppr] mamba_snps[,mu_hat:=mod$mu.hat]
We can use the mod$outliermat
object to identify the "most probable outlier study" for non-replicable SNPs.
The columns in mod$outliermat
assess posterior probability of an outlier given that true SNP association is zero and non-replicable. The number is only interpretable for SNPs with low PPR, where higher values indicate a summary statistic which dominated the meta-analysis result.
mod$outliermat[ppr < 0.5][1:3] top_outlier<-melt(mod$outliermat, id.vars=c("ppr", "snp"), variable.factor=F, value.name="outlier_prob", variable.name="outlier_study")[order(-outlier_prob)][,head(.SD,1),by=.(snp)] mamba_snps<-merge(mamba_snps, top_outlier[,-c("ppr"),with=F], by.x="SNP", by.y="snp",sort=FALSE)
Dr. Brad Efron has written extensively on the connection between empirical bayes posterior probability and false discovery rates, for example here: Empirical Bayes Analysis of a Microarray Experiment
Below we calculate an estimated FDR based on posterior probability (PPR) from the MAMBA model.
mamba_snps[,ppr_rank:=rank(-ppr,ties.method="max")] mamba_snps[order(-ppr),Fdr:=cumsum(1-ppr)/ppr_rank] mamba_snps[,Fdr:=max(Fdr),by=.(ppr_rank)] mamba_snps[,ppr_rank:=NULL] mamba_snps[order(-ppr),.(SNP, p_fe=P, ppr, Fdr)][1:5]
We use parametric bootstrap procedure to calculate p-values based on estimated PPR.
Basically, this means
Simulating data based on the parameters we estimated in mod
.
The function get_null_scores()
using the call below outputs PPR for non-replicable SNPs into the file mamba_pval/chr14.nullscores.txt
.
Note: The argument total_null_scores
will affects the minimum p-value you can possibly obtain. For assessing a SNP at genome-wide significance, you need at least $20\times 10^6$ null scores, $\frac{1}{(20\times 10^6+1)} \approx 5\times 10^{-8}$
Note: clean_any_existing_scores=FALSE
. When set to FALSE, simultaneous calls to get_null_scores()
can dump output to the same file mamba_pval/chr14.nullscores.txt
. This can speed up computation when additional computing resources are available.
Comparing the PPR from the simulated non-replicable SNPs (data generated under the null hypothesis), with the estimated PPR from our original data. For a SNP $i$ with PPR score=$y$ and a set of simulated null SNP PPR scores $x$, the bootstrap p-value can be calculated as
$$p_i = \frac{1 + \sum (y < x)}{1 + |x|}$$ , where $|x|$ is the number of simulated null SNP PPR scores.
For this short example, we settle for 1 million (10^6) scores. With clean_any_existing_scores=FALSE
, we see some have already been generated by a prior call to the function.
system("if [ ! -d mamba_pvals ]; then mkdir mamba_pvals/ ; fi ") null_scores<-mamba:::get_null_scores(model=mod, s2=sjk2[,-c("chr", "bp", "ref", "alt", "snp"),with=F], total_null_scores=10^6, out="mamba_pvals/chr14", clean_any_existing_scores=FALSE)
mod$pval<-sapply(mod$ppr, function(score) (1 + sum(score < null_scores[[1]]))/(1 + length(null_scores[[1]]))) mamba_snps[,mamba_p:=mod$pval] mamba_snps[order(mamba_p),.(SNP, mamba_p, ppr, Fdr)][1:10]
We can make some further comparisons with the initial fixed effects result with the results from the MAMBA model by looking at the holdout dataset excluded from our analysis.
### attach held-out dataset to results holdout<-fread(paste0(datadir, "/study_1.assoc")) mamba_snps<- merge(mamba_snps, holdout[,.(snp=paste0(chr, ":", bp, "_", ref, "_", alt), beta_rep=betaj, p_rep=p)], by.x="SNP", by.y="snp", all.x=T, sort=F) ### Proportion of SNPs with PPR > 0.5 which replicate direction of effect in the holdout cohort mamba_snps[ppr > 0.5][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))] ### Proportion of SNPs with PPR > 0.5 which replicate direction of effect and have ### replication p-value < 0.01 in the holdout cohort mamba_snps[ppr > 0.5][!is.na(p_rep),mean(p_rep < .01 & (sign(beta_rep)==sign(mu_hat)))] ### Comparing replication using a fixed effects p-value threshold corresponding to the ### PPR cutoff=0.5 used above. max_fe_p<-mamba_snps[ppr > 0.5][,max(P)] mamba_snps[P < max_fe_p][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))] mamba_snps[P < max_fe_p][!is.na(p_rep),mean(p_rep < .01 & (sign(beta_rep)==sign(mu_hat)))] ### Replication for SNPs with likely outlier summary statistics mamba_snps[outlier_prob > 0.5][ppr < 0.5][index_snp==1][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))] mamba_snps[outlier_prob > 0.5][ppr < 0.5][index_snp==1][!is.na(p_rep),mean(p_rep < .05 & (sign(beta_rep)==sign(mu_hat)))] ### Correlation between mamba effects size estimates ### and replication effect size estimates, mamba_snps[!is.na(beta_rep),cor(mu_hat, beta_rep)] ### Correlation between fixed effects size estimates ### and replication effect size estimates, mamba_snps[!is.na(beta_rep), cor(BETA, beta_rep)]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.