View source: R/plotting_functions.R
plot_clusters | R Documentation |
Generate a ggplot cluster plot based on PCA, the Barnes-Hut simulation at
theta>0 implemented in Rtsne
, the Uniform Manifold
Approximation and Projection approach implemented in umap
,
or the Discriminant Analysis of Principal Components implemented in
dapc
.
plot_clusters(
x,
facets = NULL,
plot_type = "pca",
check_duplicates = FALSE,
minimum_percent_coverage = FALSE,
minimum_genotype_percentage = FALSE,
smart_PCA = TRUE,
interpolation_method = "bernoulli",
dims = 2,
initial_dims = 50,
perplexity = FALSE,
theta = 0,
iter = 1000,
viridis.option = "viridis",
alt.palette = NULL,
ncp = NULL,
ncp.max = 5,
dapc_clustering_max_n_clust = 20,
dapc_clustering_npca = NULL,
dapc_clustering_nclust = NULL,
dapc_npca = NULL,
dapc_ndisc = NULL,
ellipse_size = 1.5,
seg_lines = TRUE,
shape_has_more_levels = TRUE,
update_bib = FALSE,
verbose = FALSE,
simplify_output = FALSE,
...
)
x |
snpRdata object. |
facets |
character, default NULL. Categorical sample-level metadata
variables by which to color points. Up to two different sample-specific
facets may be provided. See |
plot_type |
character, default "pca". c("pca", "tSNE", "umap", "dapc"). Types of plots to be produced. Options
See description for details. |
check_duplicates |
logical, default FALSE. Checks for any duplicated individuals, which will cause errors. Since these rarely exist and drastically slow down function run-time, this defaults to FALSE. |
minimum_percent_coverage |
numeric, default FALSE. Proportion of samples a SNP must be sequenced in to be used in generating plots. |
minimum_genotype_percentage |
numeric, default FALSE. Proportion of SNPs a sample must be sequenced at in order to be used in plots. |
smart_PCA |
logical, default TRUE. If TRUE, uses Patterson et. al (2006)'s centering approach prior to plot construction. Note that this also avoids the need for interpolation, so interpolation is set to FALSE in this case. |
interpolation_method |
character, default "bernoulli". Interpolation method to use for missing data. Options:
Ignored if |
dims |
numeric, default 2. Output dimensionality, default 2. |
initial_dims |
numeric, default 50. The number of dimensions retained in the initial PCA step during tSNE. |
perplexity |
numeric, default FALSE. Perplexity parameter, by default
found by |
theta |
numeric, default 0. Theta parameter from
|
iter |
numeric, default 1000. Number of tSNE iterations/umap epochs to perform. |
viridis.option |
character, default "viridis". Viridis color scale option
to use for significance lines and SNP labels. See
|
alt.palette |
character or NULL, default NULL. Optional palette of colors to use instead of the viridis palette. |
ncp |
numeric or NULL, default NULL. Number of components to consider for iPCA sn format interpolations of missing data. If null, the optimum number will be estimated, with the maximum specified by ncp.max. This can be very slow. |
ncp.max |
numeric, default 5. Maximum number of components to check for when determining the optimum number of components to use when interpolating sn data using the iPCA approach. |
dapc_clustering_max_n_clust |
numeric or NULL, default 20. If not NULL,
the clustering parameters for DAPC calculation will be selected
interactively, with |
dapc_clustering_npca |
numeric or NULL, default NULL. The number of PCS
to use for assigning individuals to clusters with DAPC. Interactive decision
is recommended using |
dapc_clustering_nclust |
numeric or NULL, default NULL. The number of
clusters to use for DAPC. Interactive decision is recommended using
|
dapc_npca |
numeric or NULL, default NULL. The number of PCS to use for
conducting the DAPC itself after assigning individuals to clusters.
Interactive decision is recommended using
|
dapc_ndisc |
numeric or NULL, default NULL. The number of discriminants
to use for conducting the DAPC itself after assigning individuals to
clusters. Interactive decision is recommended using
|
ellipse_size |
numeric or NULL, default 1.5. The scaled-size of the ellipse to use for DAPC. If NULL, no ellipses will be calculated or drawn. |
seg_lines |
logical, default TRUE. If TRUE, lines will be drawn between points and cluster centers when plotting with DAPC. |
shape_has_more_levels |
logical, default TRUE. If TRUE and two facets are requested, the facet with more levels will plotted as shapes. If FALSE, the facet with less levels will be plotted with shapes. Ignored if the facet that would get shapes has more than 6 levels. |
update_bib |
character or FALSE, default FALSE. If a file path to an
existing .bib library or to a valid path for a new one, will update or
create a .bib file including any new citations for methods used. Useful
given that this function does not return a snpRdata object, so a
|
verbose |
Logical, default FALSE. If TRUE, some progress updates may be reported. |
simplify_output |
If TRUE, only the ggplot object will be return. This is optimal, since the data is already returned in that object, but is not the default due to backwards consistency with old code. Note, however, that PCA loadings will only be returned if this is true. |
... |
Other arguments, passed to |
Works by conversion to the "sn" format described in format_snps
with interpolated missing genotypes for all methods other than DAPC.
Cluster plots can be produced via, PCA, tSNE, umap, or DAPC. The PCA point
coordinates are calculated using prcomp
. By default, the first
two principal coordinates are plotted. A PC matrix will also be returned for
easy plotting of other PCs. tSNE coordinates are calculated via
Rtsne
, which should be consulted to for more details
about this method. Stated simply, tSNE attempts to compress a
multi-dimensional PCA (PCs 1:n) into fewer dimensions while retaining as much
information as possible. As such, a tSNE plot can be seen as a representation
of many different PC axis compressed into a single two-dimensional plot. This
compression process is stochastic, and so plots will vary somewhat between
runs, and multiple runs are recommended. Uniform Manifold Approximation and
Projection (UMAP) coordinates are calculated via umap
.
UMAP similarly attempts to reduce multi-dimensional results to a two
dimensional visualization. DAPC instead clusters individuals in n
groups, a number which by default is interactively chosen (again using a
PCA framework).
Note that clusters and relative positions of samples from both tSNE and UMAP may not reliably represent the relationships present in the higher PCA dimensions from which they are created. As such, it is probably not wise to use these methods to draw conclusions about relationships. They are useful exploratory tools, however, and so are kept available here.
For more details on tSNE arguments, Rtsne
should be
consulted.
Additional arguments to the UMAP can be also be provided. Additional
information on these arguments can be found in
umap.defaults
.
Data points for individuals can be automatically colored by any sample-level
facet categories. Facets should be provided as described in
Facets_in_snpR
. Up to two different sample-level facets can be
automatically plotted simultaneously. If two facets are supplied, one level
will be noted by point shape and the other by color (by default the facet with
more options will be given shapes, behavior that can be controlled using the
shape_has_more_levels
argument), as long as one has less than 6 total
levels. If both have more than 6 levels, one will be noted by point fill and
the other by point outline.
A list containing:
data: Raw PCA, tSNE, umap, and/or DAPC plot data.
plots: ggplot PCA, tSNE, umap, and/or DAPC plots.
Each of these two lists may contain one to four objects, one for each PCA,
tSNE, umap, or DAPC plot requested, named "pca" "tsne", "umap", and "dapc"
respectively. If a PCA was run, the loadings will also be returned in the
top-level list. If simplify_output
is FALSE
, only the ggplot
list is returned.
William Hemstrom
Matt Thorstensen
Jesse H. Krijthe (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne.
Van Der Maaten, L. & Hinton, G. (2008) Visualizing high-dimensional data using t-SNE. journal of machine learning research. Journal of Machine Learning Research.
McInnes, L. & Healy (2018). UMAP: uniform manifold approximation and projection. Preprint at URL: https://arxiv.org/abs/1802.03426.
Jombart, T., Devillard, S. & Balloux, F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11, 94 (2010). https://doi.org/10.1186
mmtsne
umap
prcomp
dapc
# plot colored by population
plot_clusters(stickSNPs, "pop")
# plot colored by population and family
plot_clusters(stickSNPs, "pop.fam")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.