In privefl/mypack: Analysis of Massive SNP Arrays

Here we show how to estimate the full posterior distribution of genetic value using LDpred2 as described in Large uncertainty in individual PRS estimation impacts PRS-based risk stratification.

This tutorial assumes that you have already tuned the hyperparameters using grid search. Please refer to Computing polygenic scores using LDpred2 for tutorials on data preprocessing and LDpred2-grid.

options(htmltools.dir.version = FALSE, width = 75, max.print = 30)
knitr::opts_knit$set(global.par = TRUE, root.dir = "..")
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', dev = 'png')

# install.packages("runonce")
zip <- runonce::download_file(
  "https://github.com/privefl/bigsnpr/raw/master/data-raw/public-data3.zip",
  dir = "tmp-data")
unzip(zip)


## ---- echo=FALSE---------------------------------------------------------
unlink(paste0("tmp-data/public-data3", c(".bk", ".rds")))


## ------------------------------------------------------------------------
# Load packages bigsnpr and bigstatsr
library(bigsnpr)
# Read from bed/bim/fam, it generates .bk and .rds files.
snp_readBed("tmp-data/public-data3.bed")
# Attach the "bigSNP" object in R session
obj.bigSNP <- snp_attach("tmp-data/public-data3.rds")
# See how the file looks like
str(obj.bigSNP, max.level = 2, strict.width = "cut")
# Get aliases for useful slots
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection
(NCORES <- nb_cores())


## ------------------------------------------------------------------------
# Read external summary statistics
sumstats <- bigreadr::fread2("tmp-data/public-data3-sumstats.txt")
str(sumstats)


## ------------------------------------------------------------------------
set.seed(1)
ind.val <- sample(nrow(G), 350)
ind.test <- setdiff(rows_along(G), ind.val)

sumstats$n_eff <- sumstats$N
map <- setNames(obj.bigSNP$map[-3], c("chr", "rsid", "pos", "a1", "a0"))
df_beta <- snp_match(sumstats, map, join_by_pos = FALSE)  # use rsid instead of pos


## ------------------------------------------------------------------------
POS2 <- obj.bigSNP$map$genetic.dist
tmp <- tempfile(tmpdir = "tmp-data")

for (chr in 1:22) {

  # print(chr)

  ## indices in 'df_beta'
  ind.chr <- which(df_beta$chr == chr)
  ## indices in 'G'
  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]

  corr0 <- snp_cor(G, ind.col = ind.chr2, size = 3 / 1000,
                   infos.pos = POS2[ind.chr2], ncores = NCORES)

  if (chr == 1) {
    ld <- Matrix::colSums(corr0^2)
    corr <- as_SFBM(corr0, tmp, compact = TRUE)
  } else {
    ld <- c(ld, Matrix::colSums(corr0^2))
    corr$add_columns(corr0, nrow(corr))
  }
}

Approximating the posterior distribution of genetic value

Given the set of tuned hyperparameters (see above), we want to approximate the full posterior distribution of an individual's genetic value, $\small{GV_i = \mathrm{\bf{x}}_i^T \boldsymbol\beta}$. From the posterior, one can obtain the posterior mean, $\small{\widehat{PRS}_i \equiv \mathbb{E}(GV_i|\mathrm{Data})}$, and various metrics of uncertainty such as the posterior variance, $\small{var(GV_i|\mathrm{Data})}$, where $\small{GV_i}$ is the genetic value of individual $i$ and $\mathrm{\small{Data}}$ refers to a given GWAS, and $\rho$-credible interval, the interval within which the individual's genetic value falls with probability $\rho$.

An individual's genetic value is their genotype vector multipled by the causal effect vector. We therefore need to obtain MCMC samples from the posterior of the causal effects. This is done by supplying the tuned hyperparameters (best_param in the example below) to snp_ldpred2_grid and setting return_sampling_betas = TRUE. In the example below, we return 500 MCMC samples from the posterior of the causal effects with the option num_iter = 500.

Note that, when using return_sampling_betas = TRUE, you must supply one set of parameters only, i.e. best_param must be a data frame with exactly 1 row (otherwise it will raise an error).

best_param <- data.frame(p = 0.01, h2 = 0.2, sparse = FALSE)

posterior_beta_samples <- snp_ldpred2_grid(
  corr, df_beta, best_param,
  return_sampling_betas = TRUE, num_iter = 500)
dim(posterior_beta_samples)

Posterior samples of the individual's genetic value are then obtained by multiplying their genotype with the posterior samples of $\boldsymbol{\beta}$.

posterior_gv_samples <- big_prodMat(G, posterior_beta_samples, ind.col = df_beta[["_NUM_ID_"]])
dim(posterior_gv_samples)

The autocorrelation is generally weak, otherwise, you can perform some thinning.

acf(posterior_gv_samples[1, ], lag.max = 10, plot = TRUE)$acf
rowMeans(apply(posterior_gv_samples, 1, function(x) {
  acf(x, lag.max = 10, plot = FALSE)$acf
}))

Estimating the posterior mean and posterior variance

From the posterior samples of genetic value (see above), we can compute summary statistics such as the posterior mean and posterior variance. The posterior mean can be interpreted as the individual's polygenic (risk) score, i.e. $\small{\widehat{PRS}_i \equiv \mathbb{E}(GV_i|\mathrm{Data})}$. The posterior variance, $\small{var(GV_i|\mathrm{Data})}$, is one metric of uncertainty; credible intervals are discussed below.

samples <- posterior_gv_samples[1, ]
posterior_gv_mean <- mean(samples)
posterior_gv_var <- var(samples)

Constructing credible intervals of genetic value

Another way to quantify uncertainty in an individual's polygenic score is by constructing a $\rho$-credible interval of the individual's genetic value, i.e. the range of values that contains the individual's true genetic value with probability $\rho$.

The example below demonstrates how one can obtain a 95% credible interval of the individual's genetic value.

rho <- 0.95
bound <- (1 - rho) / 2
samples <- posterior_gv_samples[1, ]
mean <- mean(samples)
lower_ci <- quantile(samples, bound)
upper_ci <- quantile(samples, 1 - bound)
hist(samples, main = "Posterior distribution of genetic value", xlab = NULL) 
abline(v = c(lower_ci, mean, upper_ci), col = c("blue", "red", "blue"), lty = c(2,1,2))
legend("topright", legend = c("Credible Interval", "PRS"),
       col = c("blue", "red"), lty = c(2,1), cex = 0.8)

To get for all individuals:

means <- rowMeans(posterior_gv_samples)
limits <- t(apply(posterior_gv_samples, 1, quantile, probs = c(bound, 1 - bound)))

privefl/mypack documentation built on April 20, 2024, 1:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

privefl/mypack
Analysis of Massive SNP Arrays

In privefl/mypack: Analysis of Massive SNP Arrays

Approximating the posterior distribution of genetic value

Estimating the posterior mean and posterior variance

Constructing credible intervals of genetic value

R Package Documentation

Browse R Packages

We want your feedback!

privefl/mypack Analysis of Massive SNP Arrays

In privefl/mypack: Analysis of Massive SNP Arrays

Approximating the posterior distribution of genetic value

Estimating the posterior mean and posterior variance

Constructing credible intervals of genetic value

R Package Documentation

Browse R Packages

We want your feedback!

privefl/mypack
Analysis of Massive SNP Arrays