In this document, we show how to compute polygenic risk scores using Stacked Clumping and Thresholding (SCT).
options(htmltools.dir.version = FALSE, width = 75) knitr::opts_knit$set(global.par = TRUE, root.dir = "..") knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', dev = 'png')
You can download data and unzip files in R. We store those files in a directory called "tmp-data"
here.
unzip("data-raw/public-data.zip")
unlink(paste0("tmp-data/public-data", c(".bk", ".rds"))) unlink(paste0("tmp-data/public-data-scores", c(".bk", ".rds")))
You can see there how we generated these data from the 1000 Genomes project. Note that these data are for educational purposes only, not for use as a reference panel.
First, you need to read genotype data from the PLINK files (or BGEN files) as well as the text file containing summary statistics.
# Load packages bigsnpr and bigstatsr library(bigsnpr) # Read from bed/bim/fam, it generates .bk and .rds files. snp_readBed("tmp-data/public-data.bed") # Attach the "bigSNP" object in R session obj.bigSNP <- snp_attach("tmp-data/public-data.rds") # See how the file looks like str(obj.bigSNP, max.level = 2, strict.width = "cut") # Get aliases for useful slots G <- obj.bigSNP$genotypes CHR <- obj.bigSNP$map$chromosome POS <- obj.bigSNP$map$physical.pos y <- obj.bigSNP$fam$affection - 1 NCORES <- nb_cores() # Check some counts for the 10 first variants big_counts(G, ind.col = 1:10) # Read external summary statistics sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt", select = c(1:6, 10)) str(sumstats)
We split genotype data using part of the data to learn parameters of stacking and another part of the data to evaluate statistical properties of polygenic risk score such as AUC. Here we consider that there are 400 individuals in the training set.
set.seed(1) ind.train <- sample(nrow(G), 400) ind.test <- setdiff(rows_along(G), ind.train)
To match variants contained in genotype data and summary statistics, the variables "chr"
(chromosome number), "pos"
(genetic position), "a0"
(reference allele) and "a1"
(derived allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.
names(sumstats) <- c("chr", "rsid", "pos", "a0", "a1", "beta", "p") map <- obj.bigSNP$map[,-(2:3)] names(map) <- c("chr", "pos", "a0", "a1") info_snp <- snp_match(sumstats, map)
If no or few variants are actually flipped, you might want to disable the strand flipping option. Here, these are simulated data so all variants use the same strand and the same reference.
info_snp <- snp_match(sumstats, map, strand_flip = FALSE) # beta and lpval need to have the same length as ncol(G), CHR and POS # -> one solution is to use missing values and use the 'exclude' parameter beta <- rep(NA, ncol(G)) beta[info_snp$`_NUM_ID_`] <- info_snp$beta lpval <- rep(NA, ncol(G)) lpval[info_snp$`_NUM_ID_`] <- -log10(info_snp$p)
First, the function snp_grid_clumping()
computes sets of variants resulting from the clumping procedure that is applied repeatedly with different values of hyper-parameters (threshold of correlation for clumping, window size, and possibly imputation accuracy threshold). By default, the function uses 28 (7 thresholds of correlation x 4 window sizes) different sets of hyper-parameters for generating sets of variants resulting from clumping.
# The clumping step might take some time to complete all_keep <- snp_grid_clumping(G, CHR, POS, ind.row = ind.train, lpS = lpval, exclude = which(is.na(lpval)), ncores = NCORES) attr(all_keep, "grid")
Then, for each chromosome, for each set of variants resulting from clumping and for each p-value threshold, the function snp_grid_PRS()
computes C+T scores.
multi_PRS <- snp_grid_PRS(G, all_keep, beta, lpval, ind.row = ind.train, backingfile = "tmp-data/public-data-scores", n_thr_lpS = 50, ncores = NCORES) dim(multi_PRS) ## 4200 C+T scores for 400 individuals
A penalized regression is finally used to learn an optimal linear combination of C+T scores.
final_mod <- snp_grid_stacking(multi_PRS, y[ind.train], ncores = NCORES, K = 4) summary(final_mod$mod)
Here, I use K = 4
folds because this example data is very small; the default is to use 10
.
For options for fitting penalized regressions, see this vignette.
From stacking C+T scores, we can derive a unique vector of weights and compare effects resulting from stacking to the initial regression coefficients provided as summary statistics.
new_beta <- final_mod$beta.G ind <- which(new_beta != 0)
library(ggplot2) ggplot(data.frame(y = new_beta, x = beta)[ind, ]) + geom_abline(slope = 1, intercept = 0, color = "red") + geom_abline(slope = 0, intercept = 0, color = "blue") + geom_point(aes(x, y), size = 0.6) + theme_bigstatsr() + labs(x = "Effect sizes from GWAS", y = "Non-zero effect sizes from SCT")
We can use this vector of variant weights to compute polygenic risk scores on the test set and evaluate the Area Under the Curve (AUC).
pred <- final_mod$intercept + big_prodVec(G, new_beta[ind], ind.row = ind.test, ind.col = ind) AUCBoot(pred, y[ind.test])
ggplot(data.frame( Phenotype = factor(y[ind.test], levels = 0:1, labels = c("Control", "Case")), Probability = 1 / (1 + exp(-pred)))) + theme_bigstatsr() + geom_density(aes(Probability, fill = Phenotype), alpha = 0.3)
Instead of stacking, an alternative is to choose the best C+T score based on the computed grid. This procedure is appealing when there are not enough individuals to learn the stacking weights.
library(dplyr) grid2 <- attr(all_keep, "grid") %>% mutate(thr.lp = list(attr(multi_PRS, "grid.lpS.thr")), id = row_number()) %>% tidyr::unnest(cols = "thr.lp") s <- nrow(grid2) grid2$auc <- big_apply(multi_PRS, a.FUN = function(X, ind, s, y.train) { # Sum over all chromosomes, for the same C+T parameters single_PRS <- rowSums(X[, ind + s * (0:2)]) ## replace by 0:21 in real data bigstatsr::AUC(single_PRS, y.train) }, ind = 1:s, s = s, y.train = y[ind.train], a.combine = 'c', block.size = 1, ncores = NCORES)
max_prs <- grid2 %>% arrange(desc(auc)) %>% slice(1:10) %>% print() %>% slice(1)
ind.keep <- unlist(purrr::map(all_keep, max_prs$id)) sum(lpval[ind.keep] > max_prs$thr.lp) AUCBoot( snp_PRS(G, beta[ind.keep], ind.test = ind.test, ind.keep = ind.keep, lpS.keep = lpval[ind.keep], thr.list = max_prs$thr.lp), y[ind.test] )
For this example, the best C+T predictions provides an AUC of 67% whereas stacking, which should be preferred, provides an AUC of 68.5%.
Privé, Florian, et al. "Making the most of Clumping and Thresholding for polygenic scores." Am J Hum Genet (2019).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.