Here we show how to estimate the full posterior distribution of genetic value using LDpred2 as described in Large uncertainty in individual PRS estimation impacts PRS-based risk stratification.
This tutorial assumes that you have already tuned the hyperparameters using grid search. Please refer to Computing polygenic scores using LDpred2 for tutorials on data preprocessing and LDpred2-grid.
options(htmltools.dir.version = FALSE, width = 75, max.print = 30) knitr::opts_knit$set(global.par = TRUE, root.dir = "..") knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', dev = 'png')
# install.packages("runonce") zip <- runonce::download_file( "https://github.com/privefl/bigsnpr/raw/master/data-raw/public-data3.zip", dir = "tmp-data") unzip(zip) ## ---- echo=FALSE--------------------------------------------------------- unlink(paste0("tmp-data/public-data3", c(".bk", ".rds"))) ## ------------------------------------------------------------------------ # Load packages bigsnpr and bigstatsr library(bigsnpr) # Read from bed/bim/fam, it generates .bk and .rds files. snp_readBed("tmp-data/public-data3.bed") # Attach the "bigSNP" object in R session obj.bigSNP <- snp_attach("tmp-data/public-data3.rds") # See how the file looks like str(obj.bigSNP, max.level = 2, strict.width = "cut") # Get aliases for useful slots G <- obj.bigSNP$genotypes CHR <- obj.bigSNP$map$chromosome POS <- obj.bigSNP$map$physical.pos y <- obj.bigSNP$fam$affection (NCORES <- nb_cores()) ## ------------------------------------------------------------------------ # Read external summary statistics sumstats <- bigreadr::fread2("tmp-data/public-data3-sumstats.txt") str(sumstats) ## ------------------------------------------------------------------------ set.seed(1) ind.val <- sample(nrow(G), 350) ind.test <- setdiff(rows_along(G), ind.val) sumstats$n_eff <- sumstats$N map <- setNames(obj.bigSNP$map[-3], c("chr", "rsid", "pos", "a1", "a0")) df_beta <- snp_match(sumstats, map, join_by_pos = FALSE) # use rsid instead of pos ## ------------------------------------------------------------------------ POS2 <- obj.bigSNP$map$genetic.dist tmp <- tempfile(tmpdir = "tmp-data") for (chr in 1:22) { # print(chr) ## indices in 'df_beta' ind.chr <- which(df_beta$chr == chr) ## indices in 'G' ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr] corr0 <- snp_cor(G, ind.col = ind.chr2, size = 3 / 1000, infos.pos = POS2[ind.chr2], ncores = NCORES) if (chr == 1) { ld <- Matrix::colSums(corr0^2) corr <- as_SFBM(corr0, tmp, compact = TRUE) } else { ld <- c(ld, Matrix::colSums(corr0^2)) corr$add_columns(corr0, nrow(corr)) } }
Given the set of tuned hyperparameters (see above), we want to approximate the full posterior distribution of an individual's genetic value, $\small{GV_i = \mathrm{\bf{x}}_i^T \boldsymbol\beta}$. From the posterior, one can obtain the posterior mean, $\small{\widehat{PRS}_i \equiv \mathbb{E}(GV_i|\mathrm{Data})}$, and various metrics of uncertainty such as the posterior variance, $\small{var(GV_i|\mathrm{Data})}$, where $\small{GV_i}$ is the genetic value of individual $i$ and $\mathrm{\small{Data}}$ refers to a given GWAS, and $\rho$-credible interval, the interval within which the individual's genetic value falls with probability $\rho$.
An individual's genetic value is their genotype vector multipled by the causal effect vector. We therefore need to obtain MCMC samples from the posterior of the causal effects. This is done by supplying the tuned hyperparameters (best_param
in the example below) to snp_ldpred2_grid
and setting return_sampling_betas = TRUE
. In the example below, we return 500 MCMC samples from the posterior of the causal effects with the option num_iter = 500
.
Note that, when using return_sampling_betas = TRUE
, you must supply one set of parameters only, i.e. best_param
must be a data frame with exactly 1 row (otherwise it will raise an error).
best_param <- data.frame(p = 0.01, h2 = 0.2, sparse = FALSE)
posterior_beta_samples <- snp_ldpred2_grid( corr, df_beta, best_param, return_sampling_betas = TRUE, num_iter = 500) dim(posterior_beta_samples)
Posterior samples of the individual's genetic value are then obtained by multiplying their genotype with the posterior samples of $\boldsymbol{\beta}$.
posterior_gv_samples <- big_prodMat(G, posterior_beta_samples, ind.col = df_beta[["_NUM_ID_"]]) dim(posterior_gv_samples)
The autocorrelation is generally weak, otherwise, you can perform some thinning.
acf(posterior_gv_samples[1, ], lag.max = 10, plot = TRUE)$acf rowMeans(apply(posterior_gv_samples, 1, function(x) { acf(x, lag.max = 10, plot = FALSE)$acf }))
From the posterior samples of genetic value (see above), we can compute summary statistics such as the posterior mean and posterior variance. The posterior mean can be interpreted as the individual's polygenic (risk) score, i.e. $\small{\widehat{PRS}_i \equiv \mathbb{E}(GV_i|\mathrm{Data})}$. The posterior variance, $\small{var(GV_i|\mathrm{Data})}$, is one metric of uncertainty; credible intervals are discussed below.
samples <- posterior_gv_samples[1, ] posterior_gv_mean <- mean(samples) posterior_gv_var <- var(samples)
Another way to quantify uncertainty in an individual's polygenic score is by constructing a $\rho$-credible interval of the individual's genetic value, i.e. the range of values that contains the individual's true genetic value with probability $\rho$.
The example below demonstrates how one can obtain a 95% credible interval of the individual's genetic value.
rho <- 0.95 bound <- (1 - rho) / 2 samples <- posterior_gv_samples[1, ] mean <- mean(samples) lower_ci <- quantile(samples, bound) upper_ci <- quantile(samples, 1 - bound) hist(samples, main = "Posterior distribution of genetic value", xlab = NULL) abline(v = c(lower_ci, mean, upper_ci), col = c("blue", "red", "blue"), lty = c(2,1,2)) legend("topright", legend = c("Credible Interval", "PRS"), col = c("blue", "red"), lty = c(2,1), cex = 0.8)
To get for all individuals:
means <- rowMeans(posterior_gv_samples) limits <- t(apply(posterior_gv_samples, 1, quantile, probs = c(bound, 1 - bound)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.