knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5, fig.height = 4 )
library(mrclust)
MR-Clust is used to investigate the presence of clustered heterogeneity in Mendelian randomization analyses. Clustered heterogeneity is identified by the presence of two or more distinct clusters of genetic variants - such that genetic variants within a cluster recover similar estimates of the causal effect of the risk-factor on the outcome, i.e. their ratio-estimates are similar in direction, magnitude and precision. A cluster will typically comprise of two or more genetic variants and the ratio-estimates and standard errors of these variants are then used to compute the cluster centres. A cluster might represent a distinct pathway with which the risk-factor is related to the outcome and hence clustered heterogeneity is interesting to investigate as the identity of the genetic variants in the clusters may reveal information about the risk factor and how it relates to the outcome.
MR-Clust performs likelihood based clustering on a sample of ratio-estimates and associated standard error information. The MR-Clust mixture-model automatically accounts for the possibility that some or all of the ratio-estimates may be drawn under two distributions - the results from which we do not wish to interpret - (i) a null distribution, in which the ratio-estimates are centred around zero and; (ii) a junk distribution, which aims to model highly dispersed ratio-estimates which are considered outliers in the sample. In some situations, the junk-component may fail to capture all outliers in the sample, instead these observations may appear in clusters comprising a single genetic variant. Unless there is some underlying motivation for doing so, we do not recommend interpreting results from clusters containing one (or a small number of) genetic variants.
options(warn = -1)
install.packages("devtools", repos='http://cran.us.r-project.org') library(devtools) install_github("cnfoley/mrclust", build_vignettes = TRUE)
We introduce the MR-Clust package by allowing users to replicate an analysis from our manuscript.
Here we replicate the results from the MR-Clust manuscript: assessing the presence of clustered heterogeneity in the analysis of systolic blood pressure (SBP) and its effects on coronary artery disease (CAD) risk. We begin by loading the package and the SBP/CAD summary data. These data are avaiable with the package and can be found by typing
library(mrclust) sbp_cad <- mrclust::SBP_CAD head(sbp_cad)
Data from the other BP trait analyses, i.e. the diastolic blood pressure (DBP) and pulse pressure (PP) analyses, can be accessed similarly:
# dbp_cad <- mrclust::DBP_CAD # pp_cad <- mrclust::PP_CAD
In the combined systolic blood pressure (SBP) and coronary artery disease (CAD) dataset there are 8 variables: chromosome position, "chr.pos"; RSID; genetic associations with risk-factor SBP, "bx"; standard error of the genetic association with risk-factor SBP, "bxse"; genetic associations with outcome CAD, "by"; standard error of the genetic association with outcome CAD, "byse" and; the A1 and A2 alleles, "a1" and "a2". A description of these data is found by typing:
?mrclust::SBP_CAD
The function "mrclust" (minimally) requires the following summary data information:
############################################################## # estimates of the G-X associations and their standard errors ############################################################## bx = sbp_cad$bx bxse = sbp_cad$bxse ############################################################## # estimates of the G-Y associations and their standard errors ############################################################## by = sbp_cad$by byse = sbp_cad$byse ########################################### # ratio-estimates and their standard errors ########################################### ratio_est = by/bx ratio_est_se = byse/abs(bx) ##################### # Names for the snps ##################### snp_names = sbp_cad$chr.pos
To run a default MR-Clust analysis we use the "mr_clust_em" function (which employs algorithm 1 in the MR-Clust manuscript):
############################ ########### SBP-CAD analysis ############################ set.seed(2000030885) res_em = mr_clust_em(theta = ratio_est, theta_se = ratio_est_se, bx = bx, by = by, bxse = bxse, byse = byse, obs_names = snp_names)
The function returns a list whose elements summarise the output of the MR-Clust algorithm in 5 ways, which we discuss in turn. The names of these 5 summaries are found via:
names(res_em)
The "results" variable is a list containing two data tables "results\$all" and "results\$best". Both tables provide summaries of the ratio-estimates and standard errors for each genetic variant. In addition to this, data table "all" contains information on the allocation probability of each variant being assigned to one of the K+2 possible clusters; data table "best", however, presents information on the largest allocation probability for each variant only, i.e. each variant is assigned to a single (most probable) cluster. Hence, the results in data table "best" are considered the best clustering of the data:
############################# # results contain two tables: ############################# names(res_em$results) ################################################################################ # first few rows of table containing summaries for the genetic variants # allocated to "all" clusters ################################################################################ head(res_em$results$all) ################################################################################ # first few rows of table containing summaries for the genetic variants # allocated to the "best" cluster ################################################################################ head(res_em$results$best)
The results reveal that MR-Clust identified K=7 clusters in the SBP-CAD data: a null cluster, junk cluster and five additional clusters:
###################### # Identified clusters ###################### clusters = unique(res_em$results$best$cluster_class); # same answer if we type # unique(res_em$results$all$cluster_class) clusters[order(clusters)] ###################### # number of clusters K ###################### K = length(clusters) K
MR-Clust allows users to stratify the variants in a cluster by deciles of the allocation probability, which might help when aiming to prioritise the variants with high quality allocation probabilities for follow-up analyses. We do this using the second summary returned by MR-Clust: "cluster.membership". This is a list - of length the number of clusters - which, for each cluster, batches the variants, we use cluster 3 to illustrate the idea:
###################### # clusters ###################### names(res_em$cluster_membership) ########################################################################### # Variants assinged to the first few deciles of the allocation probability ########################################################################### head(res_em$cluster_membership$cluster_1)
There are 8 variants assigned to cluster 1 with an allocation probability in the range (0.9,1), there is 1 variant with an allocation prbability in the range (0.8, 0.9) and no variants with allocation probabilities in the range (0.7,0.8). We can single out the variants with the largest allocation probabilities by typing
######################################################### # Variants assinged to cluster one with probability >0.9 ######################################################### res_em$cluster_membership$cluster_1$pr_range_1_to_0.9
A scatter plot of results from the two-stage regression is returned by default with each variant colour coded according to their cluster assignment (using the best clustering). This is found by typing
################################### # Scatter plot of the fitted model ################################### plot_sbp_best = res_em$plots$two_stage + ggplot2::xlim(0, max(abs(bx) + 2*bxse)) + ggplot2::xlab("Genetic association with SBP") + ggplot2::ylab("Genetic association with CAD") + ggplot2::ggtitle("") ############## # plot output ############## plot_sbp_best
As noted from our manuscript, the quality of the best clustering of the data typically improves when considering variants assigned to a cluster with a large allocation probability and for clusters containing several variants. For ease, we will assume that a large allocation probability is above 0.8 and several variants means 4 or more. We can make use of the MR-Clust function "pr_clust", to remove varinats which do not meet these criteria:
################################################################################ # Variants allocated to clusters with: (i) at least 4 variants and; (ii) # allocation probability >=0.8 ################################################################################ res80 = mrclust::pr_clust(dta = res_em$results$best, prob = 0.8, min_obs = 4)
We can then re-plot the data using the results above and using the MR-Clust (two-stage regression estimate) plotting function "two_stage_plot", e.g.
############################################################## # Update summary effects using the prioritised set of variants ############################################################## keep80 = which(snp_names %in% res80$observation) bx80 = bx[keep80] bxse80 = bxse[keep80] by80 = by[keep80] byse80 = byse[keep80] snp_names80 = snp_names[keep80] plot.sbp.pr80 = two_stage_plot(res = res80, bx = bx80, by = by80, bxse = bxse80, byse = byse80, obs_names = snp_names80) + ggplot2::xlim(0, max(abs(bx80) + 2*bxse80)) + ggplot2::xlab("Genetic association with SBP") + ggplot2::ylab("Genetic association with CAD") + ggplot2::ggtitle(""); # plot result plot.sbp.pr80
MR-Clust allows users to search for traits associated with each of the variants in the non-null or junk clusters, to identify candidate mechanisms which might explain the presence of clustered heterogeneity in the data. To do this we make use of the variable "trait.search" in "rm_clust_em". Setting "trait.search=TRUE" will perform a trait search for each of the variants that are not in either the null or junk clusters, "trait.search" calls the "phenoscanner" function in the "MendelianRandomization" R package ( https://CRAN.R-project.org/package=MendelianRandomization ). This function takes in several variables which we use in MR-Clust also, i.e. the variables/parameters "catalogue" (default "GWAS"); proxies (deafult "None"); pvalue (coded "trait.pvalue"; default $10^{-5}$); build (genome build options 37 or 38, default 37); r2 (coded "proxy.r2", r2 = 0.8). See Phenoscanner (http://www.phenoscanner.medschl.cam.ac.uk/information/) for more information on the specification of these paramaters.
################################################### # Re-run analysis with trait search option == TRUE ################################################### res_em = mr_clust_em(theta = ratio_est, theta_se = ratio_est_se, bx = bx, by = by, bxse = bxse, byse = byse, obs_names = snp_names, max_iter = 5e3, tol = 1e-5, junk_mean = 0, results_list = list("all", "best"), cluster_membership = list(by_prob = 0.25, bound = 0.75), plot_results = list("best", min_pr = 0.5), trait_search = TRUE, trait_pvalue = 1e-5, proxy_r2 = 0.8, catalogue = "GWAS", proxies = "None", build = 37)
In addition to the output decribed previously, "mr_clust_em" returns a variable "trait.search". This is accessed by typing "res_em$trait.search":
############################## # Access trait search results ############################## res_search = res_em$trait_search #################### # variables returned #################### names(res_search)
The variable "trait" contains information on the traits identified as having an association with at least one variant in "cluster" below the p-value "p_cut_off". The variable "associated.snps" denotes the total number of snps in the cluster which are associated with the trait and "total_snps" denotes the total number of variants in the "cluster" analysed. Note that "total_snps" may be smaller than the number of variants assinged to "cluster". This is beacuse MR-Clust performs a variant-trait search for variants assinged to a cluster in the upper most strata of "cluster.membership" (under the assumption that this will lead to better true positive trait associations). In our analysis above we specified that our "cluster.membership" variable be organised into allocation probability steps of size 0.25 from 1 to bound=0.75, i.e.
############################################################ # Setting # "cluster.membership = list(by_prob = 0.25, bound = 0.75)" # in the argument of "mr_clust_em" # returns: ############################################################ head(res_em$cluster_membership)
Only these variants (in the top strata and not from either the null or junk clusters) are considered in the trait search.
We continue by inspecting results from the largest cluster of traits in the data, which we find via the following:
############################################### # Count the number of variants in each cluster ############################################### tab_clust = table(res_em$results$best$cluster_class) tab_clust ########################################################## # Identify the cluster with the largest number of variants ########################################################## mx_clust = which.max(tab_clust[!names(tab_clust) %in% c("Null", "Junk")]) clst_trt_srch = paste0("cluster_",mx_clust)
For smaller values of p-value cutt-off, the number of traits identified can be large and many traits will be associated with a small number of variants in a cluster. To avoid inspecting these types of results, we can choose to inspect only results in which the difference between the total number of snps assessed "total.snps" and those identified as being associated with the trait "associated.snps" is `small', i.e. "total.snps"-"associated.snps" is small. For illustrative purposes, we take this to be less than half "total.snps", i.e.
############################## # Half the total number of snp ############################## diff_snps_tol = floor(tab_clust[mx_clust]/2) ############################################## # Difference between total and associated snps ############################################## diff_snps = as.numeric(res_em$trait_search$total_snps) - as.numeric(res_em$trait_search$associated_snps)
Finally some results!
head(res_em$trait_search[which(diff_snps <= diff_snps_tol & res_em$trait_search$cluster == clst_trt_srch),])
...unsurprisingly some BP traits have been identified.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.