In ThomasChln/snpclust: SNP Feature Summarization for Unsupervised Clustering

SNPClust is demonstrated here on a small subset of the human genome diversity project dataset: 157 European samples out of 940 and 5,000 SNPs out of 660,918.

library(snpclust)

The files are first converted to the snpgds format (cf. R package SNPRelate).

gds_path <- save_hgdp_as_gds()

SNPClust is then called on the GDS filepath.

snpclust_object <- snpclust(gds = gds_path, n_axes = 20)

file.remove(gds_path)

Quality control

Details about the quality control of the dataset are stored in a data frame.

knitr::kable(snpclust_object$qc, 'markdown')

Principal component analysis

The results of principal component analysis (PCA) applied to the quality controlled dataset are stored in a long data frame. Here we see that samples are grouped by country of origin.

ggplot_pca(snpclust_object$pca, group = 'population', ellipses = TRUE)
opticskxi::ggpairs(snpclust_object$pca, axes = 1:3, group = 'population') %>%
  plot
opticskxi::ggpairs(snpclust_object$pca, axes = 4:6, group = 'population') %>%
  plot

SNPs contributions to principal components

For each prinicipal component, the absolute SNP contributions are displayed. SNPs are displayed by chromosome and position.

ggplot_manhat(pca = snpclust_object$pca, gdata = snpclust_object$gdata)

SNPs selection by Gaussian mixture models

The Gaussian mixture models select SNPs above the background noise of other SNPs contributions. Here the selected SNPs are colored in red.

ggplot_selection(peaks = snpclust_object$peaks, pca = snpclust_object$pca)

PCA on the SNPClust selected dataset

When PCA is applied on the SNPClust selected dataset, samples are not grouped by geographic origin anymore.

ggplot_pca(pca = snpclust_object$features_pca, group = 'population',
  ellipses = TRUE)
opticskxi::ggpairs(snpclust_object$features_pca, axes = 1:3,
  group = 'population') %>% plot
opticskxi::ggpairs(snpclust_object$features_pca, axes = 4:6,
  group = 'population') %>% plot