knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The Primo package can be used to integrate summary statistics to detect joint associations across multiple studies, allowing for the possibility of sample overlap. Here, a "study" refers to associations to a particular trait in a particular condition/cell-type/tissue-type or associations measured in a particular source/sample.

General framework {#Primo_general}

Following the method described by Gleason et al.^[Gleason et al.: https://www.biorxiv.org/content/10.1101/579581v3], Primo takes as input $m$ sets of summary statistics from each of $J$ studies and then:

  1. Estimates null and alternative density functions for each study.
  2. For each of $m$ sets of summary statistics, estimates the posterior probability of coming from each of $2^J$ association patterns representing the binary combinations of association status (null or alternative) between set $i$ and the $J$ studies.
  3. Estimates false discovery rates (FDR) to allow for selection of a posterior probability threshold to make inferences about association patterns.

The inference about association patterns may involve either a particular pattern of interest (e.g. study1+study2, but not study3 or study4) or group of combined patterns (e.g. study1+"at least 1 of studies 2/3/4").

1.1 Integrate summary statistics to estimate posterior probabilities {#postprob}

To illustrate, let's say we have obtained summary statistics from $J=4$ studies. From each study, these summary statistics include:

Here, each of the above sets of summary statistics are formed into matrices ($m \times J$), though dfs may also be a vector (length $J$) if the degrees of freedom never vary within any study.

We demonstrate Primo using the t-statistics version, which takes as input betas, sds and dfs. For a demonstration of the P-value version of Primo, see Primo for integrating P-values.

For each observation, we are interested in identifying its underlying association pattern. That is, we wish to identify the set of studies (e.g. traits) with which it is associated. To quantify the probability of each association pattern for each observation, we run an integrative analysis using Primo:

Primo_results <- Primo(betas=betas,sds=sds,dfs=dfs,alt_props=c(1e-5,rep(1e-3,3)))

In addition to the data previously described, Primo also requires the specification of alt_props, the estimated proportion of statistics that comes from the alternative distribution for each study. Here we specified $10^{-5}$ for the first study and $10^{-3}$ for the other 3 studies.

If the observations are SNPs, we may also have obtained minor allele frequencies (MAF) in the form of either a matrix ($m \times J$) or a vector (length $m$). In such cases, MAF and the number of subjects (N) may also be passed to the function to further adjust the error variance of the moderated t-statistics:

Primo_results <- Primo(betas=betas,sds=sds,dfs=dfs,alt_props=c(1e-5,rep(1e-3,3)),mafs=MAF,N=N)

Results

Primo_results now holds a list of 10 items. The primary elements of interest are:

The remaining elements are returned largely for use by other functions.

1.2 Combining association patterns into interpretable results {#collapse_pp}

From the results of Primo, we can combine posterior probabilities into interpretable results by summing over association patterns. For example, the following will create a matrix of the posterior probabilities of being associated with:

postprob_atLeastN <- Primo::collapse_pp_num(post_prob=Primo_results$post_prob)

And the following will provide the posterior probability of being associated with the first study and:

postprob_traitX <- Primo::collapse_pp_trait(post_prob=Primo_results$post_prob,req_idx=1)

1.3 Estimating the false discovery rate (FDR) {#fdr}

Primo estimates the false discovery rate (FDR) at specified posterior probability threshold(s) to guide selection of a threshold for inference. For probability threshold $\lambda$ and a vector $\hat{P}$ of the estimated probabilities of each variant belonging to the (possibly collapsed) pattern of interest, the estimated FDR is given by

$$ estFDR(\lambda) = \frac{\sum_i (1-\hat{P}_i) 1(\hat{P}_i \ge \lambda)}{#{\hat{P}_i \ge \lambda}} $$

where the index $i$ represents an observation. The following would estimate the FDR if we used a threshold of $0.8$ to identify observations associated with all four studies:

Primo::calc_fdr(Primo_results$post_prob[,16],thresh=0.8)

We can also estimate the FDR for collapsed probabilities and/or use a grid of possible thresholds to guide selection of an appropriate threshold. For example:

sapply(seq(0.95,0.75,-0.05), 
       function(th) Primo::calc_fdr(postprob_atLeastN[,"PP_ge2"],thresh=th))

Primo tailored to provide mechanistic interpretations of trait-associated SNPs {#Primo_snps}

Beyond its use as a general integrative analysis tool, Primo incorporates tailored developments to provide molecular mechanistic interpretations of known complex trait-associated SNPs by integrating summary statistics from GWAS and QTL studies. Primo takes as input $m$ sets of summary statistics from $J$ complex trait and omics studies, and then:

  1. Estimates key parameters and results as described in the general Primo framework (e.g. post_prob and pis) using all SNPs in the genome across all $J$ complex and omics traits.
  2. Focuses on the $S$ regions harboring GWAS SNPs to obtain the probability of association for SNPs in those regions, and identifies distinct lead omics SNPs.
  3. Performs conditional analysis of GWAS SNPs adjusting for distinct lead omics SNPs in each omics trait.
  4. Reports which GWAS SNPs are still associated with omics traits after conditional analysis adjusting for lead omics SNPs.
  5. Calculates estimated FDR for collapsed patterns of interests (GWAS+at least 1 omics, GWAS+at least 2 omics, etc).

Note that $m$ may be larger than the total number of SNPs if a SNP can be mapped to multiple outcomes (e.g. genes) within the same omics study.

2.1 Estimating key parameters

As an illustrative example, let's assume we have obtained $m$ sets of summary statistics from the associations of genetic variants with 1 complex trait and 3 omics traits (for a total of $J=4$ traits). As in the general version of Primo, Primo takes as input matrices ($m \times J$) of summary statistics:

Note that dfs, mafs, and N may also be vectors (of length $J$, $m$ and $J$, respectively).

We estimate key parameters using all SNPs in the genome by running an integrative analysis using Primo:

Primo_results <- Primo(betas=betas,sds=sds,dfs=dfs,alt_props=c(1e-5,rep(1e-3,3)),mafs=mafs,N=N)

Here, for alt_props (the estimated proportion of statistics that come from the alternative distribution), we specified $10^{-5}$ for the complex trait ($j=1$) and $10^{-3}$ for the 3 omics traits ($j \in {2,3,4})$.

While not needed by the main Primo function, it is also important to store the associated identifiers for variants and traits forming the $m$ rows of our data. Here, we store them in a data.frame called myID:

myID <- data.frame(SNP=paste0("SNP",1:12),study1="complex",study2=paste0("gene",rep(LETTERS[1:4],each=3)),study3=paste0("CpG1-for-gene",rep(LETTERS[1:4],each=3)),study4=paste0("protein",rep(LETTERS[1:4],each=3)))
head(myID,5)

2.2 Focus on regions harboring GWAS SNPs

Now we can subset the Primo results to the $S$ regions harboring GWAS SNPs. If myGenes holds the names of genes in the GWAS regions, then we subset by:

gwas_region_idx <- which(myID$study2 %in% myGenes)
Primo_gwas <- Primo::subset_Primo_obj(Primo_results,gwas_region_idx)
myID_gwas <- myID[gwas_region_idx,]

(See Identifying gene regions harboring GWAS loci for tip to identify genes in GWAS regions.)

2.3 Conditional analysis

Primo performs conditional analysis to assess whether the trait-association of a particular variant may be due to being in LD with a nearby variant that is a lead SNP for one (or more) of the traits. To conduct conditional analysis, Primo needs:

The LD_mat can be estimated using genotypes from one of the studies or a reference dataset (e.g. 1000 Genomes)^[1000 Genomes Project: http://www.internationalgenome.org/]. Row and column names of LD_mat should match the corresponding SNP names in IDs.

snp_info should be a data.frame with at least three columns:

snp_info <- data.frame(SNP=paste0("SNP",1:12),CHR=rep(1,12),POS=seq(1000,1033,3))
head(snp_info,3)

Note that if pvals is from the full results ($m$ rows), then it should also be subset to the GWAS regions if the Primo results were subset to the GWAS regions:

pvals <- pvals[gwas_region_idx,]

Now we run conditional analysis for the known complex trait-associated SNPs:

conditional_results <- 
  Primo::run_conditional_gwas(Primo_obj=Primo_gwas,IDs=myID_gwas,
                              gwas_snps=gwas_snps,pvals=pvals,
                              LD_mat=LD_mat,snp_info=snp_info,
                              pp_thresh=0.8, LD_thresh=0.9,
                              dist_thresh=5e3, pval_thresh=1e-2)

For each known trait-associated SNP, the function will condition on any lead SNPs that are $> 5$ Mb away from the trait-associated SNP, provided that those lead SNPs are not in high LD with the trait-associated SNP ($r^2<0.9$) and demonstrate some marginal association with the phenotype for which they are the lead SNP ($p<10^{-2}$). The function will determine omics associations and FDR using posterior probability $> 0.8$.

Primo::run_conditional_gwas returns a list containing two elements: pp_grouped and fdr, described in the next two sections.

2.4 GWAS SNPs still associated with omics traits after conditional analysis

We use the list element pp_grouped returned by Primo::run_conditional_gwas to determine which trait-associated SNPs are still associated with omics traits after conditional analysis. Note that only results for the SNPs specified in gwas_snps are returned. The first $J+1$ columns of conditional_results$pp_grouped hold the SNP and trait identifiers:

conditional_results <- list()
conditional_results$pp_grouped <- data.frame(SNP=c("SNP1","SNP5"),study1="complex",study2=c("geneA","geneB"),
                                             study3=c("CpG1-for-geneA","CpG1-for-geneB"),
                                             study4=c("proteinA","proteinB"),
                                             pp_nQTL_ge1=c(0.94,0.88),pp_nQTL_ge2=c(0.86,0.81), pp_nQTL_ge3=c(0.48,0.34),
                                             nQTL_orig=c(2,2),nQTL_final=c(2,1),top_pattern_precond=c(13,12),top_pattern_postcond=c(13,6),
                                             poss_QTL_assoc=c("study2;study4","study2"))
head(conditional_results$pp_grouped[,1:5],2)

The remaining columns of pp_grouped hold:

head(conditional_results$pp_grouped[,6:ncol(conditional_results$pp_grouped)],2)

Note that because a known trait-associated SNP may be mappable to multiple outcomes (e.g. genes) for the same omics trait, there may be more than one row in conditional_results$pp_grouped for a given trait-associated (GWAS) SNP. This allows the user to identify all outcomes (e.g. genes) with which the SNP may be associated. Often, a SNP-level summary will be desirable. The following will provide a SNP-level summary of the number of omics associations across all outcomes:

pp_grouped_maxN <- conditional_results$pp_grouped %>% 
                      dplyr::group_by(SNP) %>% 
                      dplyr::slice(which.max(nQTL_final))
pp_grouped_maxN <- data.frame(pp_grouped_maxN)[,c("SNP","nQTL_final")]

2.5 Estimating the false discovery rate (FDR)

In the list element fdr, the function Primo::run_conditional_gwas returns a named vector of the estimated false discovery rates (FDR) for each of the collapsed association patterns ("GWAS + at least $x$ omics trait(s)", for $x \in {1,...,J}$).

The false discovery rate (FDR) is estimated in similar fashion to the general version of Primo. However, after conditional analysis, we adjust the numerator to account for SNPs which change pattern after conditional analysis and no longer match the collapsed pattern description since we consider them to be estimated false discoveries. For SNPs which change pattern after conditional analysis such that they no longer match the collapsed association pattern (e.g. "GWAS + at least $x$ omics trait(s)"), their contribution to the numerator of the following equation is corrected to be 1:

$$ estFDR(\lambda) = \frac{\sum_i (1-\hat{P}_i) 1(\hat{P}_i \ge \lambda)}{#{\hat{P}_i \ge \lambda}} $$


Primo for integrating P-values {#pvalue}

3.1 Integrating P-values

In addition to integrating effect sizes and standard errors (i.e. t-statistics), Primo can also integrate P-values or other second-order association statistics. For example, if effect sizes and standard errors are not available, Primo can perform integrative analysis of $m$ sets of P-values from $J$ studies as in the following:

Primo_results <- Primo(pvals=pvals,alt_props=c(1e-5,rep(1e-3,3)),use_method="pval")

Here, pvals is a matrix ($m \times J$) of the (marginal) association P-values. For alt_props (the estimated proportion of statistics that come from the alternative distribution), we specified $10^{-5}$ for the first study and $10^{-3}$ for the other 3 studies (thus $J=4$ in the example). We specified use_method="pval" so that Primo did not try to run the default t-statistics version.

The P-value version of Primo returns a list of 7 elements, the first four of which have the same interpretation as when running Primo with effect sizes and standard errors (i.e. the t-statistic version). The primary elements of interest are again:

The remaining elements are returned largely for use by other functions.


Tips and tricks {#tips}

4.1 Creating input matrices {#creating_input}

In many cases, the summary statistics from different studies or traits will be stored in multiple places. To create the necessary inputs for Primo, we recommend utilizing functions from the data.table^[data.table package: https://cran.r-project.org/web/packages/data.table/index.html] package to read and align the data.

For example, lets's say we are interested in integrating complex trait-GWAS summary statistics (stored in a file titled "GWAS_results.txt") with eQTL summary statistics (stored in "expression_results.txt"). We start by reading in the data:

library(data.table)
gwas_stats <- data.table::fread("GWAS_results.txt")
eqtl_stats <- data.table::fread("expression_results.txt")
gwas_stats <- data.frame(SNP="SNP1",trait="complex",beta=0.099,sd=0.030,pval=0.001,df=19998,maf=0.17,N=20000)
eqtl_stats <- data.frame(SNP="SNP1",gene="gene1",beta=0.515,sd=0.200,pval=0.010,df=498,maf=0.15,N=500)

which may contain the following fields:

colnames(gwas_stats)
colnames(eqtl_stats)

Now we align the data by merging:

colnames(gwas_stats)[3:ncol(gwas_stats)] <- 
  paste(colnames(gwas_stats)[3:ncol(gwas_stats)],"g",sep="_")
colnames(eqtl_stats)[3:ncol(eqtl_stats)] <- 
  paste(colnames(eqtl_stats)[3:ncol(eqtl_stats)],"e",sep="_")

data.table::setkey(gwas_stats,SNP)
data.table::setkey(eqtl_stats,SNP)
merged_stats <- merge(gwas_stats,eqtl_stats)

While the first two commands aren't necessary, appending identifiers (e.g. "g" for GWAS; "e" for expression) to common variable names can make later processing clearer and easier (rather than letting merge append the defaults ".x" and ".y"). From our merged dataset merged_stats, it is easy to create the set of input matrices since the data is now properly aligned:

myID <- subset(merged_stats, select=c(SNP,trait,gene))

betas <- as.matrix(subset(merged_stats, select=paste("beta",c("g","e"),sep="_")))
sds <- as.matrix(subset(merged_stats, select=paste("sd",c("g","e"),sep="_")))
pvals <- as.matrix(subset(merged_stats, select=paste("pval",c("g","e"),sep="_")))
dfs <- as.matrix(subset(merged_stats, select=paste("df",c("g","e"),sep="_")))
mafs <- as.matrix(subset(merged_stats, select=paste("maf",c("g","e"),sep="_")))

There may be situations where we wish to merge/align the data by more than just SNP. For example, we may wish to match pairs of gene expression and protein abundance sets of summary statistics (so that a protein is aligned with the gene from which it is translated). After ensuring that the name(s) of the additional matching variable(s) match across the datasets, the merge step can be modified thusly:

data.table::setkeyv(eqtl_stats,c("SNP","gene"))
data.table::setkeyv(pqtl_stats,c("SNP","gene"))
merged_stats <- merge(eqtl_stats,pqtl_stats)

4.2 Providing mechanistic interpretations of trait-associated SNPs

4.2.1 Identifying gene regions harboring GWAS loci {#id_gene_regions}

If the genes in cis-regions harboring GWAS loci are not provided or known in advance, we can utilize the data to identify such regions. Since the trait-associated SNPs are known in advance, we store their identifiers in a vector: gwas_snps. Next we identify the indices of our identifier data.frame (myID), where the SNP is one of the known trait-associated SNPs, and use that information to identify genes in GWAS regions:

head(myID,4)
gwas_snps_idx <- which(myID$SNP %in% gwas_snps)
myGenes <- unique(myID$study2[gwas_snps_idx])

Now myGenes holds the names of genes in the GWAS regions.



kjgleason/Primo documentation built on Sept. 7, 2021, 3:58 a.m.