knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The Primo package can be used to integrate summary statistics to detect joint associations across multiple studies, allowing for the possibility of sample overlap. Here, a "study" refers to associations to a particular trait in a particular condition/cell-type/tissue-type or associations measured in a particular source/sample.
Following the method described by Gleason et al.^[Gleason et al.: https://www.biorxiv.org/content/10.1101/579581v3], Primo takes as input $m$ sets of summary statistics from each of $J$ studies and then:
The inference about association patterns may involve either a particular pattern of interest (e.g. study1+study2, but not study3 or study4) or group of combined patterns (e.g. study1+"at least 1 of studies 2/3/4").
To illustrate, let's say we have obtained summary statistics from $J=4$ studies. From each study, these summary statistics include:
betas
:coefficient estimatessds
: standard error estimates (of coefficients)dfs
: degrees of freedom for each analysispvalues
: nominal P-values from each studyHere, each of the above sets of summary statistics are formed into matrices ($m \times J$), though dfs
may also be a vector (length $J$) if the degrees of freedom never vary within any study.
We demonstrate Primo using the t-statistics version, which takes as input betas
, sds
and dfs
. For a demonstration of the P-value version of Primo, see Primo for integrating P-values.
For each observation, we are interested in identifying its underlying association pattern. That is, we wish to identify the set of studies (e.g. traits) with which it is associated. To quantify the probability of each association pattern for each observation, we run an integrative analysis using Primo:
Primo_results <- Primo(betas=betas,sds=sds,dfs=dfs,alt_props=c(1e-5,rep(1e-3,3)))
In addition to the data previously described, Primo also requires the specification of alt_props
, the estimated proportion of statistics that comes from the alternative distribution for each study. Here we specified $10^{-5}$ for the first study and $10^{-3}$ for the other 3 studies.
If the observations are SNPs, we may also have obtained minor allele frequencies (MAF) in the form of either a matrix ($m \times J$) or a vector (length $m$). In such cases, MAF and the number of subjects (N) may also be passed to the function to further adjust the error variance of the moderated t-statistics:
Primo_results <- Primo(betas=betas,sds=sds,dfs=dfs,alt_props=c(1e-5,rep(1e-3,3)),mafs=MAF,N=N)
Primo_results
now holds a list of 10 items. The primary elements of interest are:
Primo_results$post_prob
: the posterior probabilites of each association pattern for each observation ($m \times 2^J$ matrix)Primo_results$pis
: the estimated proportions of all observations belonging to each association pattern (vector of length $2^J$)The remaining elements are returned largely for use by other functions.
From the results of Primo, we can combine posterior probabilities into interpretable results by summing over association patterns. For example, the following will create a matrix of the posterior probabilities of being associated with:
postprob_atLeastN <- Primo::collapse_pp_num(post_prob=Primo_results$post_prob)
And the following will provide the posterior probability of being associated with the first study and:
postprob_traitX <- Primo::collapse_pp_trait(post_prob=Primo_results$post_prob,req_idx=1)
Primo estimates the false discovery rate (FDR) at specified posterior probability threshold(s) to guide selection of a threshold for inference. For probability threshold $\lambda$ and a vector $\hat{P}$ of the estimated probabilities of each variant belonging to the (possibly collapsed) pattern of interest, the estimated FDR is given by
$$ estFDR(\lambda) = \frac{\sum_i (1-\hat{P}_i) 1(\hat{P}_i \ge \lambda)}{#{\hat{P}_i \ge \lambda}} $$
where the index $i$ represents an observation. The following would estimate the FDR if we used a threshold of $0.8$ to identify observations associated with all four studies:
Primo::calc_fdr(Primo_results$post_prob[,16],thresh=0.8)
We can also estimate the FDR for collapsed probabilities and/or use a grid of possible thresholds to guide selection of an appropriate threshold. For example:
sapply(seq(0.95,0.75,-0.05), function(th) Primo::calc_fdr(postprob_atLeastN[,"PP_ge2"],thresh=th))
Beyond its use as a general integrative analysis tool, Primo incorporates tailored developments to provide molecular mechanistic interpretations of known complex trait-associated SNPs by integrating summary statistics from GWAS and QTL studies. Primo takes as input $m$ sets of summary statistics from $J$ complex trait and omics studies, and then:
post_prob
and pis
) using all SNPs in the genome across all $J$ complex and omics traits.Note that $m$ may be larger than the total number of SNPs if a SNP can be mapped to multiple outcomes (e.g. genes) within the same omics study.
As an illustrative example, let's assume we have obtained $m$ sets of summary statistics from the associations of genetic variants with 1 complex trait and 3 omics traits (for a total of $J=4$ traits). As in the general version of Primo, Primo takes as input matrices ($m \times J$) of summary statistics:
betas
:coefficient estimatessds
: standard error estimates (of coefficients)dfs
: degrees of freedom for each analysismafs
: minor allele frequenciesN
: number of subjectsNote that dfs
, mafs
, and N
may also be vectors (of length $J$, $m$ and $J$, respectively).
We estimate key parameters using all SNPs in the genome by running an integrative analysis using Primo:
Primo_results <- Primo(betas=betas,sds=sds,dfs=dfs,alt_props=c(1e-5,rep(1e-3,3)),mafs=mafs,N=N)
Here, for alt_props
(the estimated proportion of statistics that come from the alternative distribution), we specified $10^{-5}$ for the complex trait ($j=1$) and $10^{-3}$ for the 3 omics traits ($j \in {2,3,4})$.
While not needed by the main Primo function, it is also important to store the associated identifiers for variants and traits forming the $m$ rows of our data. Here, we store them in a data.frame called myID
:
myID <- data.frame(SNP=paste0("SNP",1:12),study1="complex",study2=paste0("gene",rep(LETTERS[1:4],each=3)),study3=paste0("CpG1-for-gene",rep(LETTERS[1:4],each=3)),study4=paste0("protein",rep(LETTERS[1:4],each=3)))
head(myID,5)
Now we can subset the Primo results to the $S$ regions harboring GWAS SNPs. If myGenes
holds the names of genes in the GWAS regions, then we subset by:
gwas_region_idx <- which(myID$study2 %in% myGenes) Primo_gwas <- Primo::subset_Primo_obj(Primo_results,gwas_region_idx) myID_gwas <- myID[gwas_region_idx,]
(See Identifying gene regions harboring GWAS loci for tip to identify genes in GWAS regions.)
Primo performs conditional analysis to assess whether the trait-association of a particular variant may be due to being in LD with a nearby variant that is a lead SNP for one (or more) of the traits. To conduct conditional analysis, Primo needs:
Primo_obj
: list of Primo results, possibly subset by Primo::subset_Primo_obj
IDs
: data.frame of identifiers for each observation in the Primo resultsgwas_snps
: character vector of known trait-associated SNPspvals
: matrix of nominal P-values for the marginal associations in each studyLD_mat
: matrix of genotype correlation coefficients ($r$) for SNPs in GWAS regionssnp_info
: data.frame of chromosome/position information for each SNPpp_thresh
: a posterior probability threshold at which to calculate FDRThe LD_mat
can be estimated using genotypes from one of the studies or a reference dataset (e.g. 1000 Genomes)^[1000 Genomes Project: http://www.internationalgenome.org/]. Row and column names of LD_mat
should match the corresponding SNP names in IDs
.
snp_info
should be a data.frame with at least three columns:
snp_info <- data.frame(SNP=paste0("SNP",1:12),CHR=rep(1,12),POS=seq(1000,1033,3))
head(snp_info,3)
Note that if pvals
is from the full results ($m$ rows), then it should also be subset to the GWAS regions if the Primo results were subset to the GWAS regions:
pvals <- pvals[gwas_region_idx,]
Now we run conditional analysis for the known complex trait-associated SNPs:
conditional_results <- Primo::run_conditional_gwas(Primo_obj=Primo_gwas,IDs=myID_gwas, gwas_snps=gwas_snps,pvals=pvals, LD_mat=LD_mat,snp_info=snp_info, pp_thresh=0.8, LD_thresh=0.9, dist_thresh=5e3, pval_thresh=1e-2)
For each known trait-associated SNP, the function will condition on any lead SNPs that are $> 5$ Mb away from the trait-associated SNP, provided that those lead SNPs are not in high LD with the trait-associated SNP ($r^2<0.9$) and demonstrate some marginal association with the phenotype for which they are the lead SNP ($p<10^{-2}$). The function will determine omics associations and FDR using posterior probability $> 0.8$.
Primo::run_conditional_gwas
returns a list containing two elements: pp_grouped
and fdr
, described in the next two sections.
We use the list element pp_grouped
returned by Primo::run_conditional_gwas
to determine which trait-associated SNPs are still associated with omics traits after conditional analysis. Note that only results for the SNPs specified in gwas_snps
are returned. The first $J+1$ columns of conditional_results$pp_grouped
hold the SNP and trait identifiers:
conditional_results <- list() conditional_results$pp_grouped <- data.frame(SNP=c("SNP1","SNP5"),study1="complex",study2=c("geneA","geneB"), study3=c("CpG1-for-geneA","CpG1-for-geneB"), study4=c("proteinA","proteinB"), pp_nQTL_ge1=c(0.94,0.88),pp_nQTL_ge2=c(0.86,0.81), pp_nQTL_ge3=c(0.48,0.34), nQTL_orig=c(2,2),nQTL_final=c(2,1),top_pattern_precond=c(13,12),top_pattern_postcond=c(13,6), poss_QTL_assoc=c("study2;study4","study2"))
head(conditional_results$pp_grouped[,1:5],2)
The remaining columns of pp_grouped
hold:
pp_thresh
)head(conditional_results$pp_grouped[,6:ncol(conditional_results$pp_grouped)],2)
Note that because a known trait-associated SNP may be mappable to multiple outcomes (e.g. genes) for the same omics trait, there may be more than one row in conditional_results$pp_grouped
for a given trait-associated (GWAS) SNP. This allows the user to identify all outcomes (e.g. genes) with which the SNP may be associated. Often, a SNP-level summary will be desirable. The following will provide a SNP-level summary of the number of omics associations across all outcomes:
pp_grouped_maxN <- conditional_results$pp_grouped %>% dplyr::group_by(SNP) %>% dplyr::slice(which.max(nQTL_final)) pp_grouped_maxN <- data.frame(pp_grouped_maxN)[,c("SNP","nQTL_final")]
In the list element fdr
, the function Primo::run_conditional_gwas
returns a named vector of the estimated false discovery rates (FDR) for each of the collapsed association patterns ("GWAS + at least $x$ omics trait(s)", for $x \in {1,...,J}$).
The false discovery rate (FDR) is estimated in similar fashion to the general version of Primo. However, after conditional analysis, we adjust the numerator to account for SNPs which change pattern after conditional analysis and no longer match the collapsed pattern description since we consider them to be estimated false discoveries. For SNPs which change pattern after conditional analysis such that they no longer match the collapsed association pattern (e.g. "GWAS + at least $x$ omics trait(s)"), their contribution to the numerator of the following equation is corrected to be 1:
$$ estFDR(\lambda) = \frac{\sum_i (1-\hat{P}_i) 1(\hat{P}_i \ge \lambda)}{#{\hat{P}_i \ge \lambda}} $$
In addition to integrating effect sizes and standard errors (i.e. t-statistics), Primo can also integrate P-values or other second-order association statistics. For example, if effect sizes and standard errors are not available, Primo can perform integrative analysis of $m$ sets of P-values from $J$ studies as in the following:
Primo_results <- Primo(pvals=pvals,alt_props=c(1e-5,rep(1e-3,3)),use_method="pval")
Here, pvals
is a matrix ($m \times J$) of the (marginal) association P-values. For alt_props
(the estimated proportion of statistics that come from the alternative distribution), we specified $10^{-5}$ for the first study and $10^{-3}$ for the other 3 studies (thus $J=4$ in the example). We specified use_method="pval"
so that Primo did not try to run the default t-statistics version.
The P-value version of Primo returns a list of 7 elements, the first four of which have the same interpretation as when running Primo
with effect sizes and standard errors (i.e. the t-statistic version). The primary elements of interest are again:
Primo_results$post_prob
: the posterior probabilites of each association pattern for each observation ($m \times 2^J$ matrix)
Primo_results$pis
: the estimated proportions of all observations belonging to each association pattern (vector of length $2^J$)
The remaining elements are returned largely for use by other functions.
In many cases, the summary statistics from different studies or traits will be stored in multiple places. To create the necessary inputs for Primo, we recommend utilizing functions from the data.table^[data.table package: https://cran.r-project.org/web/packages/data.table/index.html] package to read and align the data.
For example, lets's say we are interested in integrating complex trait-GWAS summary statistics (stored in a file titled "GWAS_results.txt") with eQTL summary statistics (stored in "expression_results.txt"). We start by reading in the data:
library(data.table) gwas_stats <- data.table::fread("GWAS_results.txt") eqtl_stats <- data.table::fread("expression_results.txt")
gwas_stats <- data.frame(SNP="SNP1",trait="complex",beta=0.099,sd=0.030,pval=0.001,df=19998,maf=0.17,N=20000) eqtl_stats <- data.frame(SNP="SNP1",gene="gene1",beta=0.515,sd=0.200,pval=0.010,df=498,maf=0.15,N=500)
which may contain the following fields:
colnames(gwas_stats) colnames(eqtl_stats)
Now we align the data by merging:
colnames(gwas_stats)[3:ncol(gwas_stats)] <- paste(colnames(gwas_stats)[3:ncol(gwas_stats)],"g",sep="_") colnames(eqtl_stats)[3:ncol(eqtl_stats)] <- paste(colnames(eqtl_stats)[3:ncol(eqtl_stats)],"e",sep="_") data.table::setkey(gwas_stats,SNP) data.table::setkey(eqtl_stats,SNP) merged_stats <- merge(gwas_stats,eqtl_stats)
While the first two commands aren't necessary, appending identifiers (e.g. "g" for GWAS; "e" for expression) to common variable names can make later processing clearer and easier (rather than letting merge
append the defaults ".x" and ".y"). From our merged dataset merged_stats
, it is easy to create the set of input matrices since the data is now properly aligned:
myID <- subset(merged_stats, select=c(SNP,trait,gene)) betas <- as.matrix(subset(merged_stats, select=paste("beta",c("g","e"),sep="_"))) sds <- as.matrix(subset(merged_stats, select=paste("sd",c("g","e"),sep="_"))) pvals <- as.matrix(subset(merged_stats, select=paste("pval",c("g","e"),sep="_"))) dfs <- as.matrix(subset(merged_stats, select=paste("df",c("g","e"),sep="_"))) mafs <- as.matrix(subset(merged_stats, select=paste("maf",c("g","e"),sep="_")))
There may be situations where we wish to merge/align the data by more than just SNP. For example, we may wish to match pairs of gene expression and protein abundance sets of summary statistics (so that a protein is aligned with the gene from which it is translated). After ensuring that the name(s) of the additional matching variable(s) match across the datasets, the merge step can be modified thusly:
data.table::setkeyv(eqtl_stats,c("SNP","gene")) data.table::setkeyv(pqtl_stats,c("SNP","gene")) merged_stats <- merge(eqtl_stats,pqtl_stats)
If the genes in cis-regions harboring GWAS loci are not provided or known in advance, we can utilize the data to identify such regions. Since the trait-associated SNPs are known in advance, we store their identifiers in a vector: gwas_snps
. Next we identify the indices of our identifier data.frame (myID
), where the SNP is one of the known trait-associated SNPs, and use that information to identify genes in GWAS regions:
head(myID,4)
gwas_snps_idx <- which(myID$SNP %in% gwas_snps) myGenes <- unique(myID$study2[gwas_snps_idx])
Now myGenes
holds the names of genes in the GWAS regions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.