Dan McGuire 2019-08-30
A Meta-Analysis Model Based Assessment of Replicability for genome-wide association studies.
The main goal of the MAMBA model is to identify loci which have non-zero replicable associations with a phenotype based on summary statistics from genome-wide association meta-analysis (GWAMA).
library(devtools)
install_github("dan11mcguire/mamba")
The MAMBA model is fit with the mamba()
function, and takes as input a matrix of effect size estimates and a matrix of their variances from GWA meta-analysis.
For a quick start, one could use
d<-generate_data_mamba()
to generate summary statistics according to the MAMBA model (SNPs are simulated independently). The returned data d$betajk
is a MxK matrix of effect size estimates, where M is the total number of SNPs and K is the total number of studies, with the jth row and kth column correspond to the effect size estimate of the jth SNP in the kth study. Similarly, d$sjk2
is a Mxk matrix of the effect size estimate variances.
The MAMBA model could then be fit by:
mod<-mamba(betajk=d$beta, sjk2=d$sjk2)
The mamba()
function then fits a two-level mixture model to the data, and calculates a posterior-probability-of-replicability (PPR) for each SNP. This is given in mod$ppr
.
For SNPs with low PPR, the model also estimates a posterior probability that a particular summary statistic is an "outlier". We use the term "outlier" here to indicate a summary statistic which may be extreme or significant due only to artifacts. This is given in the mod$outliermat
matrix.
To start, we conduct an initial fixed-effects meta-analysis to identify loci of interest which we want to validate with the MAMBA model. This could be done using METAL, rareGWAMA, or some other standard-type meta-analysis software for GWAS. As an example, we have the helper function make_mamba_summary_files()
which takes as input a set of summary statistic files, and conducts inverse-variance-weighted fixed effects meta-analysis. We can use some example summary statistic files available as external package datasets for the example analysis.
library(mamba)
#> Loading required package: TMB
#> Loading required package: RcppEigen
library(data.table)
library(parallel)
datadir<-system.file("extdata", package="mamba")
datafiles<-paste0(datadir, "/", grep("study_.*assoc", list.files(datadir),value=T))
fread(datafiles[1])[1:5] ### format of example summary statistic files
#> chr bp ref alt n af betaj sj2 p
#> 1: 14 19012028 T G 3873 0.2109570 -0.042544349 0.003150100 0.44844000
#> 2: 14 19077386 G A 3873 0.4448340 -0.009459681 0.002123235 0.83734200
#> 3: 14 19108980 T A 3873 0.0109496 0.590080092 0.048417489 0.00732496
#> 4: 14 19111033 G A 3873 0.0215132 -0.057194393 0.024909152 0.71706200
#> 5: 14 19118923 A C 3873 0.1010160 0.060027754 0.005774000 0.42954200
## holdout the first dataset to assess replicability
discovery_datasets<-setdiff(datafiles, grep("study_1.assoc", datafiles, value=T))
exmpl<-make_mamba_summary_files(summary_files=discovery_datasets,
standardize_beta_and_s2_datasets=TRUE,
beta_name="betaj", s2_name="sj2")
assoc<-exmpl$assoc
beta<-exmpl$beta
s2<-exmpl$s2
rm(exmpl)
assoc
contains the meta-analysis results (in plink
.assoc file format, with some additional columns) which can be passed directly to plink
in step2.
https://www.cog-genomics.org/plink/1.9/formats
assoc[1:3]
#> CHR BP SNP A1 A2 FRQ INF0 BETA
#> 1: 14 19012028 14:19012028_T_G G T 0.19705563 1 -0.003134722
#> 2: 14 19077386 14:19077386_G_A A G 0.43236777 1 -0.001605019
#> 3: 14 19108980 14:19108980_T_A A T 0.01007469 1 -0.022115559
#> SE P k z tot_n tot_ac
#> 1: 0.006685018 0.6391285 11 -0.4689176 304200 119888.647
#> 2: 0.005399108 0.7662567 11 -0.2972748 304200 263052.553
#> 3: 0.026866038 0.4104062 9 -0.8231791 295401 5952.149
The beta
and s2
objects will later be used as input to the mamba()
function.
Each row represents the effect size estimates (or their variances) of a SNP, and each column represents a particular study. Not every SNP is analyzed in every study, missing values are represented by NA
.
beta[1:2]
#> chr bp snp ref alt b_1 b_2 b_3
#> 1: 14 19012028 14:19012028_T_G T G -0.024547326 -4.477517 0.03258657
#> 2: 14 19077386 14:19077386_G_A G A -0.006648167 -3.235424 0.11786526
#> b_4 b_5 b_6 b_7 b_8 b_9 b_10 b_11 b_12 b_13 b_14 b_15 b_16
#> 1: -0.01951356 NA NA NA NA NA NA NA NA NA -7.512437 NA NA
#> 2: -0.03475506 NA NA NA NA NA NA NA NA NA -2.333415 NA NA
#> b_17 b_18 b_19 b_20 b_21 b_22 b_23 b_24 b_25 b_26
#> 1: NA 0.04751072 NA NA NA NA NA -0.1918304 -0.001492210 NA
#> 2: NA 0.02447689 NA NA NA NA NA 0.7427556 -0.000931085 NA
#> b_27 b_28 b_29 b_30 b_31 b_32 b_33 b_34 b_35
#> 1: NA NA 0.2870923 NA NA NA 30.391214 NA -0.7272565
#> 2: NA NA 0.2439959 NA NA NA 1.476701 NA -0.3871208
s2[1:2]
#> chr bp snp ref alt s2_1 s2_2 s2_3
#> 1: 14 19012028 14:19012028_T_G T G 0.001048694 23.97333 0.004037369
#> 2: 14 19077386 14:19077386_G_A G A 0.001048694 22.75360 0.004037369
#> s2_4 s2_5 s2_6 s2_7 s2_8 s2_9 s2_10 s2_11 s2_12 s2_13 s2_14
#> 1: 0.0006575349 NA NA NA NA NA NA NA NA NA 5.943251
#> 2: 0.0006575349 NA NA NA NA NA NA NA NA NA 4.809094
#> s2_15 s2_16 s2_17 s2_18 s2_19 s2_20 s2_21 s2_22 s2_23 s2_24
#> 1: NA NA NA 0.001979334 NA NA NA NA NA 0.9839675
#> 2: NA NA NA 0.001979334 NA NA NA NA NA 0.4770850
#> s2_25 s2_26 s2_27 s2_28 s2_29 s2_30 s2_31 s2_32 s2_33
#> 1: 1.499739e-05 NA NA NA 0.7574003 NA NA NA 8776.17643
#> 2: 1.499739e-05 NA NA NA 0.5295231 NA NA NA 20.84973
#> s2_34 s2_35
#> 1: NA 3.600483
#> 2: NA 1.029959
We can subset the analysis to SNPs which were analzed in at least 4 studies. We write out the association file to filepath: ivw_met_chr14.tsv
.
assoc<-assoc[k >=4]
beta<-beta[snp %in% assoc[,SNP]]
s2<-s2[snp %in% assoc[,SNP]]
fwrite(assoc, file="ivw_met_chr14.tsv", sep="\t")
https://www.cog-genomics.org/plink/1.9/postproc#clump
https://www.cog-genomics.org/plink/1.9/formats
An example command for clumping meta-analysis results in plink is below:
plink \
--bed "$refpanel".bed \
--bim "$refpanel".bim \
--fam "$refpanel".fam \
--clump ivw_met_chr14.tsv \
--clump-p1 1e-5 \
--clump-kb 500 \
--clump-p2 1 \
--clump-r2 0.1 \
--out example_chr14
## outputted file with clumped association statistics: example_chr14.clumped
An output file from clumping using the above command is available as an external package dataset.
datadir<-system.file("extdata", package="mamba")
clump_file<-paste0(datadir, "/", "example_chr14.clumped")
tail(fread(clump_file))
#> CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001
#> 1: 14 1 14:69803439_A_T 69803439 7.58e-06 9 0 0 0 4 5
#> 2: 14 1 14:105904300_G_A 105904300 8.13e-06 0 0 0 0 0 0
#> 3: 14 1 14:105906937_T_G 105906937 9.23e-06 0 0 0 0 0 0
#> 4: 14 1 14:27126790_A_C 27126790 9.49e-06 40 0 0 0 28 12
#> 5: 14 1 14:105893876_G_A 105893876 9.61e-06 0 0 0 0 0 0
#> 6: 14 1 14:101025848_C_T 101025848 9.79e-06 0 0 0 0 0 0
#> SP2
#> 1: 14:69801864_C_A(1),14:69803200_G_A(1),14:69804924_C_T(1),14:69806055_A_T(1),14:69807319_G_A(1),14:69808627_G_C(1),14:69809624_G_A(1),14:69813370_A_G(1),14:69816344_G_A(1)
#> 2: NONE
#> 3: NONE
#> 4: 14:26935305_G_A(1),14:26937116_G_A(1),14:26954078_T_C(1),14:26958210_T_G(1),14:26961879_T_C(1),14:26963559_T_G(1),14:26981079_G_A(1),14:26983770_C_G(1),14:26985347_A_C(1),14:26985560_G_T(1),14:26986041_G_A(1),14:26994171_C_T(1),14:26996510_C_T(1),14:27009199_A_C(1),14:27010629_C_A(1),14:27016744_T_C(1),14:27017018_T_C(1),14:27017537_A_G(1),14:27021647_T_C(1),14:27030207_T_C(1),14:27038915_T_C(1),14:27051340_A_T(1),14:27051498_A_T(1),14:27056033_T_C(1),14:27056252_C_T(1),14:27057492_T_C(1),14:27105035_C_G(1),14:27105474_A_T(1),14:27107166_G_A(1),14:27107260_A_G(1),14:27107921_T_A(1),14:27111797_G_T(1),14:27112102_A_G(1),14:27113632_G_C(1),14:27114262_C_G(1),14:27114773_T_C(1),14:27114785_C_T(1),14:27115116_C_T(1),14:27120136_G_A(1),14:27128033_A_T(1)
#> 5: NONE
#> 6: NONE
https://www.cog-genomics.org/plink/1.9/ld
An example command for pruning from a reference panel using plink is shown below.
plink \
--bed "$refpanel".bed \
--bim "$refpanel".bim \
--fam "$refpanel".fam \
--indep-pairwise 500kb 1 0.1 \
--maf 0.01 \
--out example_chr14
## outputted file with pruned snps: example_chr14.prune.in
A set of SNPs pruned from HRC reference panel using the below commands (in CHR:BP_REF_ALT format) are available as a default in the MAMBA package.
data(prunedsnps_HRC, package="mamba")
prunevars[1:5]
#> SNP
#> 1: 10:65030_C_A
#> 2: 10:94263_C_A
#> 3: 10:95248_C_G
#> 4: 10:99932_T_G
#> 5: 10:112184_A_G
We can use the helper function select_mamba_loci()
to combine the pruned and clumped variants, removing any pruned SNPs within 500kb of a clumped SNP.
mamba_snps<-select_mamba_loci(meta_file="ivw_met_chr14.tsv", clump_file=clump_file)
#> [1] "input read. creating rema dataset.."
#> Time difference of 0.01489162 secs
betajk<-beta[snp %in% mamba_snps[,SNP]]
sjk2<-s2[snp %in% mamba_snps[,SNP]]
mod<-mamba(betajk=betajk[,-c("chr", "bp", "snp", "ref", "alt"),with=F],
sjk2=sjk2[,-c("chr", "bp", "snp", "ref", "alt"),with=F],
snpids=betajk[,snp])
#> [1] "e step finished."
#> [1] "m step finished."
#> [1] "p=0.005 lambda=0.959 tau2=0.00014673 alpha=9.657"
#> [1] "iteration 1 complete."
#> [1] "e step finished."
#> [1] "m step finished."
#> [1] "p=0.007 lambda=0.969 tau2=0.00010031 alpha=12.592"
#> [1] "relative ll increase: 0.00289249858605164"
#> [1] "ll increase: 691.81104935435"
#> [1] "iteration 2 complete."
#> [1] "e step finished."
#> [1] "m step finished."
....
....
#> [1] "e step finished."
#> [1] "m step finished."
#> [1] "p=0.032 lambda=0.995 tau2=3.835e-05 alpha=68.676"
#> [1] "relative ll increase: 1.0518675706234e-08"
#> [1] "ll increase: 0.00253462721593678"
#> [1] "iteration 45 complete."
#> [1] "e step finished."
#> [1] "m step finished."
#> [1] "p=0.032 lambda=0.995 tau2=3.826e-05 alpha=68.676"
#> [1] "relative ll increase: 9.5648508132523e-09"
#> [1] "ll increase: 0.00230478931916878"
The estimated posterior-probability-of-replicability (PPR) and effect sizes are contained in the mod$ppr
and mod$mu.hat
objects, which we reattach to our initial fixed effects meta-analysis table for comparison.
mamba_snps[,ppr:=mod$ppr]
mamba_snps[,mu_hat:=mod$mu.hat]
We can use the mod$outliermat
object to identify the "most probable outlier study" for non-replicable SNPs. The columns in mod$outliermat
assess posterior probability of an outlier given that true SNP association is zero and non-replicable. The number is only interpretable for SNPs with low PPR, where higher values indicate a summary statistic which dominated the meta-analysis result.
mod$outliermat[ppr < 0.5][1:3]
#> snp Oj_1 Oj_2 Oj_3 Oj_4
#> 1: 14:19012028_T_G 0.0007593869 0.0008637992 0.0006513838 0.0007610746
#> 2: 14:19077386_G_A 0.0005842565 0.0007177489 0.0031102118 0.0014135876
#> 3: 14:19108980_T_A 0.0194177552 0.0011617249 0.0008604968 0.0040570477
#> Oj_5 Oj_6 Oj_7 Oj_8 Oj_9 Oj_10 Oj_11 Oj_12 Oj_13 Oj_14 Oj_15
#> 1: NA NA NA NA NA NA NA NA NA 0.0580568694 NA
#> 2: NA NA NA NA NA NA NA NA NA 0.0009992573 NA
#> 3: NA NA NA NA NA NA NA NA NA 0.0006687855 NA
#> Oj_16 Oj_17 Oj_18 Oj_19 Oj_20 Oj_21 Oj_22 Oj_23 Oj_24
#> 1: NA NA 0.0010033110 NA NA NA NA NA 0.0005828923
#> 2: NA NA 0.0006642334 NA NA NA NA NA 0.0010112191
#> 3: NA NA 0.0006028209 NA NA NA NA NA 0.0006064193
#> Oj_25 Oj_26 Oj_27 Oj_28 Oj_29 Oj_30 Oj_31 Oj_32
#> 1: 0.0006156617 NA NA NA 0.0006037579 NA NA NA
#> 2: 0.0005887791 NA NA NA 0.0006048313 NA NA NA
#> 3: 0.0008110691 NA NA NA 0.0006554179 NA NA NA
#> Oj_33 Oj_34 Oj_35 ppr
#> 1: 0.0006026946 NA 0.0006151846 0.01787158
#> 2: 0.0006025006 NA 0.0006147624 0.01805945
#> 3: NA NA NA 0.02179809
top_outlier<-melt(mod$outliermat,
id.vars=c("ppr", "snp"),
variable.factor=F,
value.name="outlier_prob",
variable.name="outlier_study")[order(-outlier_prob)][,head(.SD,1),by=.(snp)]
mamba_snps<-merge(mamba_snps, top_outlier[,-c("ppr"),with=F], by.x="SNP", by.y="snp",sort=FALSE)
Dr. Brad Efron has written extensively on the connection between empirical bayes posterior probability and false discovery rates, for example here: Empirical Bayes Analysis of a Microarray Experiment
Below we calculate an estimated FDR based on posterior probability (PPR) from the MAMBA model.
mamba_snps[,ppr_rank:=rank(-ppr,ties.method="max")]
mamba_snps[order(-ppr),Fdr:=cumsum(1-ppr)/ppr_rank]
mamba_snps[,Fdr:=max(Fdr),by=.(ppr_rank)]
mamba_snps[,ppr_rank:=NULL]
mamba_snps[order(-ppr),.(SNP, p_fe=P, ppr, Fdr)][1:5]
#> SNP p_fe ppr Fdr
#> 1: 14:29273634_G_A 5.653125e-09 0.9998325 0.0001674948
#> 2: 14:28356733_G_A 4.674038e-08 0.9996620 0.0002527490
#> 3: 14:29795213_G_A 1.577426e-07 0.9992481 0.0004191414
#> 4: 14:32438484_G_A 1.349447e-07 0.9986866 0.0006427105
#> 5: 14:29791660_G_C 8.369951e-07 0.9978895 0.0009362671
We use parametric bootstrap procedure to calculate p-values based on estimated PPR. Basically, this means
Simulating data based on the parameters we estimated in mod
.
The function get_null_scores()
using the call below outputs PPR for non-replicable SNPs into the file mamba_pval/chr14.nullscores.txt
.
Note: The argument total_null_scores
will affects the minimum p-value you can possibly obtain. For assessing a SNP at genome-wide significance, you need at least 20 × 106 null scores, $\frac{1}{(20\times 10^6+1)} \approx 5\times 10^{-8}$
Note: clean_any_existing_scores=FALSE
. When set to FALSE, simultaneous calls to get_null_scores()
can dump output to the same file mamba_pval/chr14.nullscores.txt
. This can speed up computation when additional computing resources are available.
Comparing the PPR from the simulated non-replicable SNPs (data generated under the null hypothesis), with the estimated PPR from our original data. For a SNP i with PPR score=y and a set of simulated null SNP PPR scores x, the bootstrap p-value can be calculated as $$p_i = \frac{1 + \sum (y < x)}{1 + |x|}$$ , where |x| is the number of simulated null SNP PPR scores.
For this short example, we settle for 1 million (10^6) scores. With clean_any_existing_scores=FALSE
, we see some have already been generated by a prior call to the function.
system("if [ ! -d mamba_pvals ]; then mkdir mamba_pvals/ ; fi ")
null_scores<-mamba:::get_null_scores(model=mod,
s2=sjk2[,-c("chr", "bp", "ref", "alt", "snp"),with=F],
total_null_scores=10^6,
out="mamba_pvals/chr14",
clean_any_existing_scores=FALSE)
#> [1] "990000 scores already generated in mamba_pvals/chr14.nullscores.txt."
#> [1] "Fitting 2 models to generate approximately 1e+06 non-replicable SNPs."
#> [1] "Model 1 of 2 complete."
#> [1] "Model 2 of 2 complete."
mod$pval<-sapply(mod$ppr,
function(score) (1 + sum(score < null_scores[[1]]))/(1 + length(null_scores[[1]])))
mamba_snps[,mamba_p:=mod$pval]
mamba_snps[order(mamba_p),.(SNP, mamba_p, ppr, Fdr)][1:10]
#> SNP mamba_p ppr Fdr
#> 1: 14:28356733_G_A 9.985312e-07 0.9996620 0.0002527490
#> 2: 14:29273634_G_A 9.985312e-07 0.9998325 0.0001674948
#> 3: 14:29795213_G_A 9.985312e-07 0.9992481 0.0004191414
#> 4: 14:32438484_G_A 9.985312e-07 0.9986866 0.0006427105
#> 5: 14:29791660_G_C 1.997062e-06 0.9978895 0.0009362671
#> 6: 14:79580218_C_T 1.997062e-06 0.9960663 0.0017957679
#> 7: 14:90057852_C_T 1.997062e-06 0.9977493 0.0011553457
#> 8: 14:98594361_A_C 1.997062e-06 0.9964996 0.0014903512
#> 9: 14:77592879_G_A 2.995593e-06 0.9953230 0.0021159029
#> 10: 14:98811932_A_G 2.995593e-06 0.9942386 0.0024804534
We can make some further comparisons with the initial fixed effects result with the results from the MAMBA model by looking at the holdout dataset excluded from our analysis.
### attach held-out dataset to results
holdout<-fread(paste0(datadir, "/study_1.assoc"))
mamba_snps<- merge(mamba_snps,
holdout[,.(snp=paste0(chr, ":", bp, "_", ref, "_", alt), beta_rep=betaj, p_rep=p)],
by.x="SNP", by.y="snp", all.x=T, sort=F)
### Proportion of SNPs with PPR > 0.5 which replicate direction of effect in the holdout cohort
mamba_snps[ppr > 0.5][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))]
#> [1] 0.8965517
### Proportion of SNPs with PPR > 0.5 which replicate direction of effect and have
### replication p-value < 0.01 in the holdout cohort
mamba_snps[ppr > 0.5][!is.na(p_rep),mean(p_rep < .01 & (sign(beta_rep)==sign(mu_hat)))]
#> [1] 0.3793103
### Comparing replication using a fixed effects p-value threshold corresponding to the
### PPR cutoff=0.5 used above.
max_fe_p<-mamba_snps[ppr > 0.5][,max(P)]
mamba_snps[P < max_fe_p][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))]
#> [1] 0.5849515
mamba_snps[P < max_fe_p][!is.na(p_rep),mean(p_rep < .01 & (sign(beta_rep)==sign(mu_hat)))]
#> [1] 0.03883495
### Replication for SNPs with likely outlier summary statistics
mamba_snps[outlier_prob > 0.5][ppr < 0.5][index_snp==1][!is.na(p_rep),mean(sign(beta_rep)==sign(mu_hat))]
#> [1] 0.5639687
mamba_snps[outlier_prob > 0.5][ppr < 0.5][index_snp==1][!is.na(p_rep),mean(p_rep < .05 & (sign(beta_rep)==sign(mu_hat)))]
#> [1] 0.04699739
### Correlation between mamba effects size estimates
### and replication effect size estimates,
mamba_snps[!is.na(beta_rep),cor(mu_hat, beta_rep)]
#> [1] 0.02926117
### Correlation between fixed effects size estimates
### and replication effect size estimates,
mamba_snps[!is.na(beta_rep), cor(BETA, beta_rep)]
#> [1] 0.004871806
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.