plot_clusters: PCA, tSNE, and umap plots from snpRdata.

View source: R/plotting_functions.R

plot_clustersR Documentation

PCA, tSNE, and umap plots from snpRdata.

Description

Generate a ggplot cluster plot based on PCA, the Barnes-Hut simulation at theta>0 implemented in Rtsne, the Uniform Manifold Approximation and Projection approach implemented in umap, or the Discriminant Analysis of Principal Components implemented in dapc.

Usage

plot_clusters(
  x,
  facets = NULL,
  plot_type = "pca",
  check_duplicates = FALSE,
  minimum_percent_coverage = FALSE,
  minimum_genotype_percentage = FALSE,
  smart_PCA = TRUE,
  interpolation_method = "bernoulli",
  dims = 2,
  initial_dims = 50,
  perplexity = FALSE,
  theta = 0,
  iter = 1000,
  viridis.option = "viridis",
  alt.palette = NULL,
  ncp = NULL,
  ncp.max = 5,
  dapc_clustering_max_n_clust = 20,
  dapc_clustering_npca = NULL,
  dapc_clustering_nclust = NULL,
  dapc_npca = NULL,
  dapc_ndisc = NULL,
  ellipse_size = 1.5,
  seg_lines = TRUE,
  shape_has_more_levels = TRUE,
  update_bib = FALSE,
  verbose = FALSE,
  simplify_output = FALSE,
  ...
)

Arguments

x

snpRdata object.

facets

character, default NULL. Categorical sample-level metadata variables by which to color points. Up to two different sample-specific facets may be provided. See Facets_in_snpR for more details.

plot_type

character, default "pca". c("pca", "tSNE", "umap", "dapc"). Types of plots to be produced. Options

  • pca: Principal Component Analysis, first two dimensions of variance.

  • tSNE: t-Stochastic Neighbor Embedding, which collapses dims (see argument) dimensions of variance into two.

  • umap: Uniform Manifold Approximation and Projection, which collapses multiple dimensions of variance into two.

  • dapc: Discriminant analysis of principal components, clusters individuals into groups for plotting via PCA.

See description for details.

check_duplicates

logical, default FALSE. Checks for any duplicated individuals, which will cause errors. Since these rarely exist and drastically slow down function run-time, this defaults to FALSE.

minimum_percent_coverage

numeric, default FALSE. Proportion of samples a SNP must be sequenced in to be used in generating plots.

minimum_genotype_percentage

numeric, default FALSE. Proportion of SNPs a sample must be sequenced at in order to be used in plots.

smart_PCA

logical, default TRUE. If TRUE, uses Patterson et. al (2006)'s centering approach prior to plot construction. Note that this also avoids the need for interpolation, so interpolation is set to FALSE in this case.

interpolation_method

character, default "bernoulli". Interpolation method to use for missing data. Options:

  • bernoulli: Interpolated via binomial draw for each allele against minor allele frequency.

  • af: Interpolated by inserting the expected number of minor alleles at missing data points given loci minor allele frequencies.

  • iPCA: This an iterative PCA approach to interpolate based on SNP/SNP covariance via imputePCA. If the ncp argument is not defined, the number of components used for interpolation will be estimated using estim_ncpPCA. In this case, this method is much slower than the other methods, especially for large datasets. Setting an ncp of 2-5 generally results in reasonable interpolations without the time constraint.

Ignored if smart_PCA is TRUE.

dims

numeric, default 2. Output dimensionality, default 2.

initial_dims

numeric, default 50. The number of dimensions retained in the initial PCA step during tSNE.

perplexity

numeric, default FALSE. Perplexity parameter, by default found by hbeta, with beta = 1.

theta

numeric, default 0. Theta parameter from Rtsne. Default an exhaustive search.

iter

numeric, default 1000. Number of tSNE iterations/umap epochs to perform.

viridis.option

character, default "viridis". Viridis color scale option to use for significance lines and SNP labels. See scale_gradient for details.

alt.palette

character or NULL, default NULL. Optional palette of colors to use instead of the viridis palette.

ncp

numeric or NULL, default NULL. Number of components to consider for iPCA sn format interpolations of missing data. If null, the optimum number will be estimated, with the maximum specified by ncp.max. This can be very slow.

ncp.max

numeric, default 5. Maximum number of components to check for when determining the optimum number of components to use when interpolating sn data using the iPCA approach.

dapc_clustering_max_n_clust

numeric or NULL, default 20. If not NULL, the clustering parameters for DAPC calculation will be selected interactively, with dapc_clustering_max_n_clust max clusters considered. If NULL, the parameters dapc_clustering_npca, dapc_clustering_nclust, dapc_ndisc, and dapc_npca must instead be set.

dapc_clustering_npca

numeric or NULL, default NULL. The number of PCS to use for assigning individuals to clusters with DAPC. Interactive decision is recommended using dapc_clustering_max_n_clust.

dapc_clustering_nclust

numeric or NULL, default NULL. The number of clusters to use for DAPC. Interactive decision is recommended using dapc_clustering_max_n_clust.

dapc_npca

numeric or NULL, default NULL. The number of PCS to use for conducting the DAPC itself after assigning individuals to clusters. Interactive decision is recommended using dapc_clustering_max_n_clust.

dapc_ndisc

numeric or NULL, default NULL. The number of discriminants to use for conducting the DAPC itself after assigning individuals to clusters. Interactive decision is recommended using dapc_clustering_max_n_clust.

ellipse_size

numeric or NULL, default 1.5. The scaled-size of the ellipse to use for DAPC. If NULL, no ellipses will be calculated or drawn.

seg_lines

logical, default TRUE. If TRUE, lines will be drawn between points and cluster centers when plotting with DAPC.

shape_has_more_levels

logical, default TRUE. If TRUE and two facets are requested, the facet with more levels will plotted as shapes. If FALSE, the facet with less levels will be plotted with shapes. Ignored if the facet that would get shapes has more than 6 levels.

update_bib

character or FALSE, default FALSE. If a file path to an existing .bib library or to a valid path for a new one, will update or create a .bib file including any new citations for methods used. Useful given that this function does not return a snpRdata object, so a citations cannot be used to fetch references.

verbose

Logical, default FALSE. If TRUE, some progress updates may be reported.

simplify_output

If TRUE, only the ggplot object will be return. This is optimal, since the data is already returned in that object, but is not the default due to backwards consistency with old code. Note, however, that PCA loadings will only be returned if this is true.

...

Other arguments, passed to Rtsne or umap.

Details

Works by conversion to the "sn" format described in format_snps with interpolated missing genotypes for all methods other than DAPC.

Cluster plots can be produced via, PCA, tSNE, umap, or DAPC. The PCA point coordinates are calculated using prcomp. By default, the first two principal coordinates are plotted. A PC matrix will also be returned for easy plotting of other PCs. tSNE coordinates are calculated via Rtsne, which should be consulted to for more details about this method. Stated simply, tSNE attempts to compress a multi-dimensional PCA (PCs 1:n) into fewer dimensions while retaining as much information as possible. As such, a tSNE plot can be seen as a representation of many different PC axis compressed into a single two-dimensional plot. This compression process is stochastic, and so plots will vary somewhat between runs, and multiple runs are recommended. Uniform Manifold Approximation and Projection (UMAP) coordinates are calculated via umap. UMAP similarly attempts to reduce multi-dimensional results to a two dimensional visualization. DAPC instead clusters individuals in n groups, a number which by default is interactively chosen (again using a PCA framework).

Note that clusters and relative positions of samples from both tSNE and UMAP may not reliably represent the relationships present in the higher PCA dimensions from which they are created. As such, it is probably not wise to use these methods to draw conclusions about relationships. They are useful exploratory tools, however, and so are kept available here.

For more details on tSNE arguments, Rtsne should be consulted.

Additional arguments to the UMAP can be also be provided. Additional information on these arguments can be found in umap.defaults.

Data points for individuals can be automatically colored by any sample-level facet categories. Facets should be provided as described in Facets_in_snpR. Up to two different sample-level facets can be automatically plotted simultaneously. If two facets are supplied, one level will be noted by point shape and the other by color (by default the facet with more options will be given shapes, behavior that can be controlled using the shape_has_more_levels argument), as long as one has less than 6 total levels. If both have more than 6 levels, one will be noted by point fill and the other by point outline.

Value

@return A list containing:

  • data: Raw PCA, tSNE, umap, and/or DAPC plot data.

  • plots: ggplot PCA, tSNE, umap, and/or DAPC plots.

Each of these two lists may contain one to four objects, one for each PCA, tSNE, umap, or DAPC plot requested, named "pca" "tsne", "umap", and "dapc" respectively. If a PCA was run, the loadings will also be returned in the top-level list. If simplify_output is FALSE, only the ggplot list is returned.

Author(s)

William Hemstrom

Matt Thorstensen

References

Jesse H. Krijthe (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne.

Van Der Maaten, L. & Hinton, G. (2008) Visualizing high-dimensional data using t-SNE. journal of machine learning research. Journal of Machine Learning Research.

McInnes, L. & Healy (2018). UMAP: uniform manifold approximation and projection. Preprint at URL: https://arxiv.org/abs/1802.03426.

Jombart, T., Devillard, S. & Balloux, F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11, 94 (2010). https://doi.org/10.1186

See Also

mmtsne

umap

prcomp

dapc

Examples

# plot colored by population
plot_clusters(stickSNPs, "pop")

# plot colored by population and family
plot_clusters(stickSNPs, "pop.fam")

hemstrow/snpR documentation built on July 15, 2024, 7:14 p.m.