plot_clusters: PCA, tSNE, and umap plots from snpRdata.
In hemstrow/snpR: Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data

plot_clusters

R Documentation

PCA, tSNE, and umap plots from snpRdata.

Description

Generate a ggplot cluster plot based on PCA, the Barnes-Hut simulation at theta>0 implemented in Rtsne, the Uniform Manifold Approximation and Projection approach implemented in umap, or the Discriminant Analysis of Principal Components implemented in dapc.

Usage

plot_clusters(
  x,
  facets = NULL,
  plot_type = "pca",
  check_duplicates = FALSE,
  minimum_percent_coverage = FALSE,
  minimum_genotype_percentage = FALSE,
  smart_PCA = TRUE,
  interpolation_method = "bernoulli",
  dims = 2,
  initial_dims = 50,
  perplexity = FALSE,
  theta = 0,
  iter = 1000,
  viridis.option = "viridis",
  alt.palette = NULL,
  ncp = NULL,
  ncp.max = 5,
  dapc_clustering_max_n_clust = 20,
  dapc_clustering_npca = NULL,
  dapc_clustering_nclust = NULL,
  dapc_npca = NULL,
  dapc_ndisc = NULL,
  ellipse_size = 1.5,
  seg_lines = TRUE,
  shape_has_more_levels = TRUE,
  update_bib = FALSE,
  verbose = FALSE,
  simplify_output = FALSE,
  ...
)

Arguments

`x`	snpRdata object.
`facets`	character, default NULL. Categorical sample-level metadata variables by which to color points. Up to two different sample-specific facets may be provided. See `Facets_in_snpR` for more details.
`plot_type`	character, default "pca". c("pca", "tSNE", "umap", "dapc"). Types of plots to be produced. Options pca: Principal Component Analysis, first two dimensions of variance. tSNE: t-Stochastic Neighbor Embedding, which collapses dims (see argument) dimensions of variance into two. umap: Uniform Manifold Approximation and Projection, which collapses multiple dimensions of variance into two. dapc: Discriminant analysis of principal components, clusters individuals into groups for plotting via PCA. See description for details.
`check_duplicates`	logical, default FALSE. Checks for any duplicated individuals, which will cause errors. Since these rarely exist and drastically slow down function run-time, this defaults to FALSE.
`minimum_percent_coverage`	numeric, default FALSE. Proportion of samples a SNP must be sequenced in to be used in generating plots.
`minimum_genotype_percentage`	numeric, default FALSE. Proportion of SNPs a sample must be sequenced at in order to be used in plots.
`smart_PCA`	logical, default TRUE. If TRUE, uses Patterson et. al (2006)'s centering approach prior to plot construction. Note that this also avoids the need for interpolation, so interpolation is set to FALSE in this case.
`interpolation_method`	character, default "bernoulli". Interpolation method to use for missing data. Options: bernoulli: Interpolated via binomial draw for each allele against minor allele frequency. af: Interpolated by inserting the expected number of minor alleles at missing data points given loci minor allele frequencies. iPCA: This an iterative PCA approach to interpolate based on SNP/SNP covariance via `imputePCA`. If the ncp argument is not defined, the number of components used for interpolation will be estimated using `estim_ncpPCA`. In this case, this method is much slower than the other methods, especially for large datasets. Setting an ncp of 2-5 generally results in reasonable interpolations without the time constraint. Ignored if `smart_PCA` is TRUE.
`dims`	numeric, default 2. Output dimensionality, default 2.
`initial_dims`	numeric, default 50. The number of dimensions retained in the initial PCA step during tSNE.
`perplexity`	numeric, default FALSE. Perplexity parameter, by default found by `hbeta`, with beta = 1.
`theta`	numeric, default 0. Theta parameter from `Rtsne`. Default an exhaustive search.
`iter`	numeric, default 1000. Number of tSNE iterations/umap epochs to perform.
`viridis.option`	character, default "viridis". Viridis color scale option to use for significance lines and SNP labels. See `scale_gradient` for details.
`alt.palette`	character or NULL, default NULL. Optional palette of colors to use instead of the viridis palette.
`ncp`	numeric or NULL, default NULL. Number of components to consider for iPCA sn format interpolations of missing data. If null, the optimum number will be estimated, with the maximum specified by ncp.max. This can be very slow.
`ncp.max`	numeric, default 5. Maximum number of components to check for when determining the optimum number of components to use when interpolating sn data using the iPCA approach.
`dapc_clustering_max_n_clust`	numeric or NULL, default 20. If not NULL, the clustering parameters for DAPC calculation will be selected interactively, with `dapc_clustering_max_n_clust` max clusters considered. If NULL, the parameters `dapc_clustering_npca`, `dapc_clustering_nclust`, `dapc_ndisc`, and `dapc_npca` must instead be set.
`dapc_clustering_npca`	numeric or NULL, default NULL. The number of PCS to use for assigning individuals to clusters with DAPC. Interactive decision is recommended using `dapc_clustering_max_n_clust`.
`dapc_clustering_nclust`	numeric or NULL, default NULL. The number of clusters to use for DAPC. Interactive decision is recommended using `dapc_clustering_max_n_clust`.
`dapc_npca`	numeric or NULL, default NULL. The number of PCS to use for conducting the DAPC itself after assigning individuals to clusters. Interactive decision is recommended using `dapc_clustering_max_n_clust`.
`dapc_ndisc`	numeric or NULL, default NULL. The number of discriminants to use for conducting the DAPC itself after assigning individuals to clusters. Interactive decision is recommended using `dapc_clustering_max_n_clust`.
`ellipse_size`	numeric or NULL, default 1.5. The scaled-size of the ellipse to use for DAPC. If NULL, no ellipses will be calculated or drawn.
`seg_lines`	logical, default TRUE. If TRUE, lines will be drawn between points and cluster centers when plotting with DAPC.
`shape_has_more_levels`	logical, default TRUE. If TRUE and two facets are requested, the facet with more levels will plotted as shapes. If FALSE, the facet with less levels will be plotted with shapes. Ignored if the facet that would get shapes has more than 6 levels.
`update_bib`	character or FALSE, default FALSE. If a file path to an existing .bib library or to a valid path for a new one, will update or create a .bib file including any new citations for methods used. Useful given that this function does not return a snpRdata object, so a `citations` cannot be used to fetch references.
`verbose`	Logical, default FALSE. If TRUE, some progress updates may be reported.
`simplify_output`	If TRUE, only the ggplot object will be return. This is optimal, since the data is already returned in that object, but is not the default due to backwards consistency with old code. Note, however, that PCA loadings will only be returned if this is true.
`...`	Other arguments, passed to `Rtsne` or `umap`.

Details

Works by conversion to the "sn" format described in format_snps with interpolated missing genotypes for all methods other than DAPC.

Cluster plots can be produced via, PCA, tSNE, umap, or DAPC. The PCA point coordinates are calculated using prcomp. By default, the first two principal coordinates are plotted. A PC matrix will also be returned for easy plotting of other PCs. tSNE coordinates are calculated via Rtsne, which should be consulted to for more details about this method. Stated simply, tSNE attempts to compress a multi-dimensional PCA (PCs 1:n) into fewer dimensions while retaining as much information as possible. As such, a tSNE plot can be seen as a representation of many different PC axis compressed into a single two-dimensional plot. This compression process is stochastic, and so plots will vary somewhat between runs, and multiple runs are recommended. Uniform Manifold Approximation and Projection (UMAP) coordinates are calculated via umap. UMAP similarly attempts to reduce multi-dimensional results to a two dimensional visualization. DAPC instead clusters individuals in n groups, a number which by default is interactively chosen (again using a PCA framework).

Note that clusters and relative positions of samples from both tSNE and UMAP may not reliably represent the relationships present in the higher PCA dimensions from which they are created. As such, it is probably not wise to use these methods to draw conclusions about relationships. They are useful exploratory tools, however, and so are kept available here.

For more details on tSNE arguments, Rtsne should be consulted.

Additional arguments to the UMAP can be also be provided. Additional information on these arguments can be found in umap.defaults.

Data points for individuals can be automatically colored by any sample-level facet categories. Facets should be provided as described in Facets_in_snpR. Up to two different sample-level facets can be automatically plotted simultaneously. If two facets are supplied, one level will be noted by point shape and the other by color (by default the facet with more options will be given shapes, behavior that can be controlled using the shape_has_more_levels argument), as long as one has less than 6 total levels. If both have more than 6 levels, one will be noted by point fill and the other by point outline.

Value

A list containing:

data: Raw PCA, tSNE, umap, and/or DAPC plot data.
plots: ggplot PCA, tSNE, umap, and/or DAPC plots.

Each of these two lists may contain one to four objects, one for each PCA, tSNE, umap, or DAPC plot requested, named "pca" "tsne", "umap", and "dapc" respectively. If a PCA was run, the loadings will also be returned in the top-level list. If simplify_output is FALSE, only the ggplot list is returned.

Author(s)

William Hemstrom

Matt Thorstensen

References

Jesse H. Krijthe (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne.

Van Der Maaten, L. & Hinton, G. (2008) Visualizing high-dimensional data using t-SNE. journal of machine learning research. Journal of Machine Learning Research.

McInnes, L. & Healy (2018). UMAP: uniform manifold approximation and projection. Preprint at URL: https://arxiv.org/abs/1802.03426.

Jombart, T., Devillard, S. & Balloux, F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11, 94 (2010). https://doi.org/10.1186

Examples

# plot colored by population
plot_clusters(stickSNPs, "pop")

# plot colored by population and family
plot_clusters(stickSNPs, "pop.fam")

hemstrow/snpR documentation built on July 5, 2025, 4:38 a.m.

hemstrow/snpR index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hemstrow/snpR
Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data

plot_clusters: PCA, tSNE, and umap plots from snpRdata.
In hemstrow/snpR: Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data

PCA, tSNE, and umap plots from snpRdata.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to plot_clusters in hemstrow/snpR...

R Package Documentation

Browse R Packages

We want your feedback!

hemstrow/snpR Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data

plot_clusters: PCA, tSNE, and umap plots from snpRdata. In hemstrow/snpR: Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data

PCA, tSNE, and umap plots from snpRdata.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to plot_clusters in hemstrow/snpR...

R Package Documentation

Browse R Packages

We want your feedback!

hemstrow/snpR
Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data

plot_clusters: PCA, tSNE, and umap plots from snpRdata.
In hemstrow/snpR: Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data