In privefl/mypack: Analysis of Massive SNP Arrays

In this document, I show how to use some of the features of packages {bigsnpr} and {bigstatsr}. Note that many functions used here come from package {bigstatsr} and could therefore be used on other data encoded as matrix-like (outside of the field of genotype data).

Get data

options(htmltools.dir.version = FALSE, width = 75)
knitr::opts_knit$set(global.par = TRUE, root.dir = "..")
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', dev = 'png', dpi = 95)

Download data and unzip files. I store those files in a directory called "tmp-data" here.

zip <- runonce::download_file(
  "https://github.com/privefl/bigsnpr/raw/master/data-raw/public-data.zip",
  dir = "tmp-data")
unzip(zip)

unlink(paste0("tmp-data/public-data", c(".bk", ".rds")))

You can see there how I generated these data from the 1000 Genomes project.

Read from the PLINK files

# Load packages bigsnpr and bigstatsr
library(bigsnpr)
# Read from bed/bim/fam, it will create new files.
snp_readBed("tmp-data/public-data.bed")
# Attach the "bigSNP" object in R session
obj.bigSNP <- snp_attach("tmp-data/public-data.rds")
# See how it looks like
str(obj.bigSNP, max.level = 2, strict.width = "cut")
# Get aliases for useful slots
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection - 1
sex <- obj.bigSNP$fam$sex
pop <- obj.bigSNP$fam$family.ID
NCORES <- nb_cores()
# Check some counts for the 10 first SNPs
big_counts(G, ind.col = 1:10)

What you must do

You need to

Explore the data
Assess population structure of the data
Find the variables associated with the disease status (y)
Predict the disease status using the following indices as training/test sets:

```r

Divide the indices in training/test sets

set.seed(1) ind.train <- sample(nrow(G), 400) ind.test <- setdiff(rows_along(G), ind.train) ```

For this, you can use whatever tools you want because the data is quite small. In the following section, I give some (scalable) solutions using using packages {bigstatsr} and {bigsnpr}.

Solution using {bigstatsr} and {bigsnpr}

Population structure: Principal Component Analysis

Let us compute first principal components of the scaled genotype matrix:

# Compute partial SVD (10 PCs by default) using random projections
# big_scale() computes means and standard deviations for scaling
svd <- big_randomSVD(G, big_scale(), ncores = NCORES)

# Scree plot
plot(svd)
library(ggplot2)
# Scores plot + color for population
plot(svd, type = "scores") +
  aes(color = pop)
plot(svd, type = "scores", scores = 3:4) +
  aes(color = pop)

# Loadings (effects of each variable for each PC)
plot(svd, type = "loadings", loadings = 1:10, coeff = 0.4)

Looking at the loadings, we can see that the PCA captures some variation due to large correlation between variables. To learn more about this possible pitfall, please look at this vignette.

Association: Genome-Wide Association Study (GWAS)

# Association of each variable of `G` with `y` (adjusting for 10 PCs)
gwas <- big_univLogReg(G, y, covar.train = svd$u, ncores = NCORES)
# Histogram of p-values
plot(gwas)
# Q-Q plot
plot(gwas, type = "Q-Q") + xlim(1, NA)  # snp_qq(gwas) + xlim(1, NA)
# Manhattan plot
snp_manhattan(gwas, CHR, POS, npoints = 20e3) +
  geom_hline(yintercept = -log10(5e-8), color = "red")

Polygenic Risk Score (PRS)

with Clumping and Thresholding (C+T)

[\rm{PRS}i = \sum{\substack{j \in S_\text{clumping} \ p_j~<~p_T}} \hat\beta_j \cdot G_{i,j}~,]

where $\hat\beta_j$ ($p_j$) are the effect sizes (p-values) estimated from the GWAS and $G_{i,j}$ is the allele count (genotype) for individual $i$ and SNP $j$.

sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt")
lpval <- -log10(sumstats$p)
ind.keep <- snp_clumping(G, CHR, ind.row = ind.train, S = lpval, infos.pos = POS, ncores = 3)
THR <- seq_log(1, 8, length.out = 20)
prs <- snp_PRS(G, sumstats$beta[ind.keep], ind.keep = ind.keep, 
               lpS.keep = lpval[ind.keep], thr.list = THR)
# Learn the optimal threshold on the training set
aucs <- apply(prs[ind.train, ], 2, AUC, target = y[ind.train])
plot(THR, aucs, xlab = "-log10(p-value)", ylab = "AUC", pch = 20)
# Evaluate on the test set
AUC(prs[ind.test, which.max(aucs)], y[ind.test])

(TODO: ADD SCT)

with Penalized Logistic Regression (PLR)

$$\arg!\min_{\beta_0,~\beta}(\lambda, \alpha)\left{ \underbrace{ -\sum_{i=1}^n \left( y_i \log\left(p_i\right) + (1 - y_i) \log\left(1 - p_i\right) \right) }\text{Loss function} + \underbrace{ \lambda \left((1-\alpha)\frac{1}{2}\|\beta\|_2^2 + \alpha \|\beta\|_1\right) }\text{Penalization} \right}$$

where

$p_i=1/\left(1+\exp\left(-(\beta_0 + x_i^T\beta)\right)\right)$
$x$ is denoting the genotypes and covariables (e.g. principal components),
$y$ is the disease status we want to predict,
$\lambda$ is a regularization parameter that needs to be determined and
$\alpha$ determines relative parts of the regularization $0 \le \alpha \le 1$.

If you want to learn more about our implementation of PLR, please look at this paper.

# Penalized logistic regression for many alphas and lambdas
mod <- big_spLogReg(G, y[ind.train], ind.train, covar.train = svd$u[ind.train, ],
                    K = 5, alphas = 10^(-(0:4)), ncores = NCORES)

# Plot regularization paths (from high lambda to low lambda) 
# for each validation set (color) and each alpha (facet)
plot(mod)
# Get summaries of models
summary(mod)

# Get the predictions for the test set
pred <- predict(mod, G, ind.test, covar.row = svd$u[ind.test, ])
# Assess the Area Under the ROC Curve
AUC(pred, y[ind.test])

privefl/mypack documentation built on April 20, 2024, 1:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com