options(htmltools.dir.version = FALSE) knitr::opts_chunk$set(fig.align = "center", dev = "png", cache = TRUE, fig.asp = 0.7, out.width = "70%")
In this vignette, I show how to perform Principal Component Analysis (PCA) with packages bigstatsr and bigsnpr [@Prive2017]. I also show why pruning and removing long-range Linkage Disequilibrium (LD) are two important steps before computing PCs in order to capture population structure [@Abdellaoui2013].
I use data from a case/control cohort on celiac disease [@Dubois2010]. The data has already been QCed, genotyped SNPs have been imputed and binary PLINK files have been converted to the "bigSNP" format used in bigsnpr (see these preprocessing steps).
library(bigsnpr) library(ggplot2) celiac <- snp_attach("backingfiles/celiacQC.rds") G <- celiac$genotypes CHR <- celiac$map$chromosome POS <- celiac$map$physical.pos NCORES <- nb_cores() # "Verification" there is no missing value big_counts(G, ind.col = 1:12) # OK # Get population from external files pop.files <- list.files(path = "data", pattern = "cluster_*", full.names = TRUE) pop <- snp_getSampleInfos(celiac, pop.files)[[1]] pop.names <- c("Netherlands", "Italy", "UK1", "UK2", "Finland")
svd1 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES)
plot(svd1, type = "scores") + aes(color = pop.names[pop]) + labs(color = "Population")
First two PCs capture population structure.
plot(svd1, type = "scores", scores = 3:4) + aes(color = pop.names[pop]) + labs(color = "Population")
PC3 and PC4 don't capture population structure.
# The SNP with max loading for PC3 theone1.1 <- which.max(abs(svd1$v[, 3])) # The SNP with max loading for PC4 theone1.2 <- which.max(abs(svd1$v[, 4])) plot(svd1, type = "scores", scores = 3:4) + aes(color = as.factor(paste(G[, theone1.1], G[, theone1.2]))) + labs(color = "Genotypes") + guides(colour = guide_legend(override.aes = list(size = rel(4))))
So basically, PC3 is capturing a variation of one SNP, as well as PC4. These two SNPs are located in long-range LD regions and corresponds to peaks in the next figure.
plot(svd1, type = "loadings", loadings = 1:4, coeff = 0.7)
In fact, I'm using clumping on the Minor Allele Frequencies (MAF) instead of pruning. You can see this vignette to know why.
ind.keep2 <- snp_clumping(G, CHR, ncores = NCORES) svd2 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES, ind.col = ind.keep2)
plot(svd2, type = "scores", scores = 3:4) + aes(color = pop.names[pop]) + labs(color = "Population")
theone2 <- ind.keep2[which.max(abs(svd2$v[, 4]))] plot(svd2, type = "scores", scores = 3:4) + aes(color = as.factor(G[, theone2])) + labs(color = "Genotype")
PC4 again captures variation at one SNP, which is possibly in a long-range LD region.
As recommended by @Price2008a, it is possible to remove a list of predetermined long-range LD regions.
ind.keep3 <- cols_along(G)[-snp_indLRLDR(CHR, POS)] svd3 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES, ind.col = ind.keep3)
plot(svd3, type = "scores", scores = 3:4) + aes(color = pop.names[pop]) + labs(color = "Population")
This is quite better at capturing population structure. Yet..
plot(svd3, type = "loadings", loadings = 1:10, coeff = 0.6)
So, first 4 PCs are mostly capturing population structure, but the next PCs are likely to capture only LD structure.
plot(svd3, type = "scores", scores = 5:6) + aes(color = pop.names[pop]) + labs(color = "Population")
theone3 <- ind.keep3[which.max(abs(svd3$v[, 5]))] plot(svd3, type = "scores", scores = 5:6) + aes(color = as.factor(G[, theone3])) + labs(color = "Genotype")
As recommend by @Abdellaoui2013, I prune AND remove long-range LD-regions.
ind.keep4 <- snp_clumping(G, CHR, ncores = NCORES, exclude = snp_indLRLDR(CHR, POS)) svd4 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES, ind.col = ind.keep4)
plot(svd4, type = "scores", scores = 3:4) + aes(color = pop.names[pop]) + labs(color = "Population")
plot(svd4, type = "scores", scores = 5:6) + aes(color = pop.names[pop]) + labs(color = "Population")
Maybe, PC5 is now capturing some non-European ancestry, or something else.
plot(svd4, type = "loadings", loadings = 1:10, coeff = 0.6)
The loadings are all approximately normally distributed (no huge peak), we're good.
You can use either the previous method (for human data only) or try the following automatic procedure to prune and remove long-range LD regions [@Prive2017].
svd0 <- snp_autoSVD(G, CHR, POS, ncores = NCORES) attr(svd0, "lrldr")
plot(svd0, type = "scores", scores = 3:4) + aes(color = pop.names[pop]) + labs(color = "Population")
plot(svd0, type = "scores", scores = 5:6) + aes(color = pop.names[pop]) + labs(color = "Population")
plot(svd0, type = "loadings", loadings = 1:10, coeff = 0.6)
Always use both pruning and removing of long-range LD regions when computing the PCs, as recommended by @Abdellaoui2013. To check that the results are capturing population structure, you should plot PCA scores. To check that PCs are not capturing LD, you should check PCA loadings.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.