knitr::opts_chunk$set(echo = TRUE, error = TRUE, warning = FALSE)

In this vignette, I show why clumping (on MAF) should be preferred over pruning for single-nucleotide polymorphism (SNP) arrays.

Pruning and clumping are used to keep a subset of SNPs that are nearly uncorrelated with each other. For instance, pruning is used before Principal Component Analysis to avoid capturing too much variance of linkage disequilibrium (LD) regions. Clumping is used to keep only one representative SNP per region of LD.

Simulation

I generate a (toy) SNP array with 500 individuals and 10 SNPs, where each SNP has a squared correlation > 0.2 with their direct neighbors and only with them. Moreover, the SNPs have an increasing MAF.

gen <- function(n, m) {
  I <- 1:m
  p <- I / (2 * m + 1)

  mat <- outer(I, I, FUN = function(i, j) {
    1 / (abs(i - j) + 1)
  })

  bindata::rmvbin(n, p, bincorr = mat) + 
    bindata::rmvbin(n, p, bincorr = mat)
}

set.seed(1)
X <- gen(500, 10)
print(head(X, 20))
print(round(cor(X)^2, 2)) # squared correlation between SNPs
print(colMeans(X) / 2) # MAF of SNPs

Let's convert this to our format bigSNP in order to write corresponding PLINK files.

library(bigsnpr)
fake <- snp_fake(nrow(X), ncol(X))
fake$genotypes[] <- X
tmp <- tempfile()
snp_writeBed(fake, paste0(tmp, ".bed"))

Pruning

Pruning, as implemented in PLINK, sequentially scan the genomes for pairs of correlated SNPs, keeping only the one with the higher MAF (see this). Let's use PLINK pruning on this simulated toy dataset.

library(glue)
plink <- download_plink()
system(glue("{plink} --bfile {tmp} --indep-pairwise 50 1 0.2 --out {tmp}"))
read.table(glue("{tmp}.prune.out"))
read.table(glue("{tmp}.prune.in"))

The first SNP is pruned because of its correlation with the second. The second SNP is pruned because of its correlation with the third and so on. In the end, only the last SNP (10th) is kept with the LD pruning procedure of PLINK, which corresponds to less than 18% of the total variance.

You can also do the pruning directly in R with snp_pruning():

snp_pruning(fake$genotypes, infos.chr = fake$map$chromosome)

Clumping

A clumping approach would consider the SNPs sorted (in a decreasing order) by a statistic. This statistic is often a test statistic computed from a GWAS of a given phenotype. Yet, for example, for Principal Components Analysis (PCA), the thinning procedure should remain unsupervised (phenotype mustn't be used!). So, we propose to also use the MAF as a statistic of importance, so that SNPs are sorted by decreasing MAF.

The benefit of the clumping procedure is that the index SNP is always the SNP that is kept (because it has the highest MAF). Let's use PLINK again:

# Compute MAFs
write.table(data.frame(SNP = fake$map$marker.ID,
                       P = 1 - snp_MAF(fake$genotypes)),
            file = paste0(tmp, ".frq"), row.names = FALSE, quote = FALSE)
# Clumping
system(glue("{plink} --bfile {tmp} --out {tmp}",
              " --clump {tmp}.frq",
              " --clump-p1 1 --clump-p2 1 --clump-r2 0.2"))
read.table(glue("{tmp}.clumped"), header = TRUE)

So the last SNP (10th) would be considered first and the 9th SNP would be pruned. Then the 8th SNP would be considered and the 7th SNP would be pruned and so on. So, the even SNPs would be kept and the odd SNPs would be pruned, which corresponds to 56.7% of the total variance.

You can also do the clumping (on MAF by default) directly in R with snp_clumping():

snp_clumping(fake$genotypes, infos.chr = fake$map$chromosome)

Conclusion

Pruning removes SNPs in a way that may leave regions of the genome with no representative SNP at all. Clumping is similar to pruning when sorting by descending MAF, but does not have this issue, and should therefore be preferred.



privefl/mypack documentation built on April 20, 2024, 1:51 a.m.