plot_structure: Create STRUCTURE-like cluster plots

View source: R/plotting_functions.R

plot_structureR Documentation

Create STRUCTURE-like cluster plots

Description

Creates ggplot-based stacked bar charts of assignment probabilities (Q) into an arbitrary 'k' number of clusters like those produced by the program STRUCTURE. Files containing prior results can be parsed and plotted, or new results can be generated via a suite of different methods. Many k values and reps can be run at once, and results can be collapsed across reps using CLUMPP.

Usage

plot_structure(
  x,
  facet = NULL,
  facet.order = NULL,
  k = 2,
  method = "snmf",
  reps = 1,
  update_bib = FALSE,
  iterations = 1000,
  burnin = 100,
  I = NULL,
  alpha = 5,
  qsort = "last",
  qsort_K = "last",
  clumpp = TRUE,
  clumpp_path = "/usr/bin/CLUMPP.exe",
  clumpp.opt = "greedy",
  structure_path = "/usr/bin/structure",
  admixture_path = "/usr/bin/admixture",
  admixture_cv = 5,
  ID = NULL,
  viridis.option = "viridis",
  alt.palette = NULL,
  t.sizes = c(12, 12, 12),
  separator_thickness = 1,
  separator_color = "white",
  no_admix = FALSE,
  use_pop_info = FALSE,
  loc_prior = FALSE,
  correlated_frequencies = TRUE,
  infer_alpha = TRUE,
  separate_pop_alphas = FALSE,
  infer_lambda = FALSE,
  infer_pop_specific_lambda = FALSE,
  lambda = 1,
  f_prior_mean = 0.01,
  f_prior_sd = 0.05,
  uniform_alpha_prior = TRUE,
  alpha_max = 10,
  alpha_prior_a = 1,
  alpha_prior_b = 2,
  gens_back = 2,
  mig_prior = 0.01,
  locprior_init_r = 1,
  locprior_max_r = 20,
  alpha_prop_sd = 0.025,
  start_at_pop_info = FALSE,
  metro_update_freq = 10,
  seed = sample(1e+05, 1),
  strip_col_names = NULL,
  cleanup = TRUE,
  ...
)

Arguments

x

snpRdata object, list of Q matrices (sorted by K in the first level and run in the second), or a character string designating a pattern matching Q matrix files in the current working directories.

facet

character, default NULL. If provided, individuals will not be noted on the x axis. Instead, the levels of the facet will be noted. Only a single, simple, sample specific facet may be provided. Individuals must be sorted by this facet in x. If Q matrices are provided (either directly or via file path), this should instead be a vector of group identities for each individual (populations, etc.).

facet.order

character, default NULL. Optional order in which the levels of the provided facet should appear on the plot, left to right.

k

numeric vector, default 2. The k value (number of clusters) for which to run the clustering/assignment algorithm. Each provided k value will be run.

method

character, default "snmf". The clustering/assignment method to run. Options:

  • snmf: sNMF (sparse Non-Negative Matrix Factorization). See main_sNMF.

  • snapclust: Maximum-likelihood genetic clustering. See snapclust.choose.k.

  • admixture: The ADMIXTURE program. Requires a local admixture executable, and thus cannot run on a Windows platform.

  • structure: The STRUCTURE program. Requires a local STRUCTURE executable. many additional options are available for STRUCTURE via other arguments.

reps

numeric, default 1. The number of independent clustering repetitions to run.

update_bib

character or FALSE, default FALSE. If a file path to an existing .bib library or to a valid path for a new one, will update or create a .bib file including any new citations for methods used. Useful given that this function does not return a snpRdata object, so a citations cannot be used to fetch references.

iterations

numeric or Inf, default 1000. For snapclust, the maximum number of iterations to run. For STRUCTURE the number of MCMC steps, should be in the 10,000+ range.

burnin

numeric, default 100. For STRUCTURE, the number of burnin iterations to do prior to the main run. This should usually be in the 10,000+ range.

I

numeric or NULL, default NULL. For snmf, how many SNPs should be used to initialize the search? Initializing with a subset of the total SNPs can radically speed up computation time for large datasets.

alpha

numeric, default 10. If method = "sNMF", determines the regularization parameter. For small datasets, this can have a large effect, and should probably be larger than the default. See documentation for main_sNMF. If method = "structure", changes the ALPHA flag, which determines the degree of admixture. If infer_alpha = TRUE, instead sets the starting point for the alpha inference.

qsort

character, numeric, or FALSE, default "last". Determines if individuals should be sorted (possibly within facet levels) by cluster assignment proportion. If not FALSE, determines which cluster to use for sorting (1:k). If "last" or "first" sorts by those clusters.

qsort_K

numeric or character, default "last". If qsorting is performed, determines the reference k value by which individuals are sorted. If "first" or "last", sorts by k = 2 or k = k, respectively.

clumpp

logical, default T. Specifies if CUMPP should be run to collapse results across multiple reps. If FALSE, will use only the first rep for plotting.

clumpp_path

character, default "/usr/bin/CLUMPP.exe". Path to the clumpp executable, required if clumpp = T.

clumpp.opt

character, default "greedy". Designates the CLUMPP method to use. Options:

  • fullsearch: Search all possible configurations. Slow.

  • greedy: The standard approach. Slow for large datasets at high k values.

  • large.k.greedy: A fast but less accurate approach.

See CLUMPP documentation for details.

structure_path

character, default "/usr/bin/structure". Path to the STRUCTURE executable, required if method = "structure".

admixture_path

character, default "/usr/bin/admixture". Path to the admixture executable, required if method = "admixture".

admixture_cv

numeric, default 5. Fold to use for cross-validation for admixture, used to determine the optimum k.

ID

character or NULL, default NULL. Designates a column in the sample metadata containing sample IDs.

viridis.option

character, default "viridis". Viridis color scale option. See scale_gradient for details.

alt.palette

character or NULL, default NULL. Optional palette of colors to use instead of the viridis palette.

t.sizes

numeric, default c(12, 12, 12). Text sizes, given as c(strip.title, axis, axis.ticks).

separator_thickness

numeric, default 1. Thickness of facet level separator lines. If 0, no separators drawn. Since separators currently overlap with samples somewhat, this may be desirable.

separator_color

character, default "white". Color of facet level separator lines.

no_admix

logical, default FALSE. Used if method = "structure". If TRUE, the NOADMIX flag in STRUCTURE will be set to 1, meaning that no admixture will be assumed between clusters.

use_pop_info

logical, default FALSE. Used if method = "structure". If TRUE, the USEPOPINFO flag in STRUCTURE will be set to 1, meaning that individuals are assumed to come from the populations that they have been assigned in the facet provided, and the migrant status of individuals and their parents, grandparents, etc (going back n generations according to the gens_back argument) will be returned instead of ancestry proportions. The resulting plot will not be a typical structure plot.

loc_prior

logical, default FALSE. Used if method = "structure". If TRUE, the LOCPRIOR flag in STRUCTURE will be set to 1. This will place a strong population prior on samples according to the facet provided. Useful with weak data when the population info is known to be robust.

correlated_frequencies

logical, default TRUE. Used if method = "structure". If TRUE, the FREQSCORR flag in STRUCTURE will be set to 1. This assumes allele frequencies are correlated between populations. Usually true when populations have some degree of admixture.

infer_alpha

logical, default TRUE. Used if method = "structure". If TRUE, the INFERALPHA flag in STRUCTURE will be set to 1, allowing the optimum alpha to be inferred. Large alpha values imply that most individuals are admixed.

separate_pop_alphas

logical, default FALSE. Used if method = "structure". If TRUE, the POPALPHAS flag in STRUCTURE will be set to 1, allowing populations to have different alpha values. Usually not recommended.

infer_lambda

logical, default FALSE. Used if method = "structure". If TRUE, the INFERLAMBDA flag in STRUCTURE will be set to 1, allowing the optimum lambda to be inferred. Smaller values imply that most alleles have either very low or very high frequencies. Not usually recommended.

infer_pop_specific_lambda

logical, default FALSE. Used if method = "structure". If TRUE, the POPSPECIFICLAMBDA flag in STRUCTURE will be set to 1, allowing for different pops to have different lambda values.

lambda

numeric, default 1. Used if method = "structure". Changes the LAMBDA flag. Smaller values imply that most alleles have either very low or very high frequencies. Used if method = "structure". The default works well in most cases.

f_prior_mean

numeric, default 0.01. Used if method = "structure". Changes the FPRIORMEAN flag. F values for each k cluster are drawn from a gamma prior with this mean. The default value tends to perform well for detecting small amounts of structure, although it can slightly overestimate k.

f_prior_sd

numeric, default 0.05. Used if method = "structure". Changes the FPRIORSD flag. F values for each k cluster are drawn from a gamma prior with this sd. The default value tends to perform well for detecting small amounts of structure, although it can slightly overestimate k.

uniform_alpha_prior

logical, default TRUE. Used if method = "structure". If TRUE, the UNIFPRIORALPHA flag in STRUCTURE will be set to 1, thus using a uniform prior for alpha between 0 and alpha_max. Usually works well. If FALSE, uses a gamma prior with mean alpha_prior_a * alpha_prior_b and variance alpha_prior_a*alpha_prior_b^2.

alpha_max

numeric, default 10. Used if method = "structure". Changes the ALPHAMAX flag. If uniform_alpha_prior is TRUE, alpha will be drawn from a uniform prior for alpha between 0 and alpha_max.

alpha_prior_a

numeric, default 1. Used if method = "structure". Changes the ALPHAPRIORA flag. If uniform_alpha_prior is FALSE, alpha will be drawn from a gamma prior with mean alpha_prior_a * alpha_prior_b and variance alpha_prior_a*alpha_prior_b^2.

alpha_prior_b

numeric, default 2. Used if method = "structure". Changes the ALPHAPRIORA flag. If uniform_alpha_prior is FALSE, alpha will be drawn from a gamma prior with mean alpha_prior_a * alpha_prior_b and variance alpha_prior_a*alpha_prior_b^2.

gens_back

numeric, default 2. Used if method = "structure". Changes the GENSBACK flag. If use_pop_info is TRUE, migration probabilities for individuals will be determined for the individual themselves plus gens_back generations prior (parents, grandparents, etc).

mig_prior

numeric, default 0.01. Used if method = "structure". Changes the MIGRPRIOR flag. Changes the prior value for the migration rate hyperparameter. Values between 0.001 and 0.01 are usually reasonable.

locprior_init_r

numeric, default 1. Used if method = "structure". Changes the LOCPRIORINIT flag. Sets the initial strength of the location prior (r). The default is often reasonable.

locprior_max_r

numeric, default 20. Used if method = "structure". Changes the MAXLOCPRIOR flag. Sets the maximum value of the location prior (r). The minimum is always 0. The default is usually reasonable.

alpha_prop_sd

numeric, default 0.025. Used if method = "structure". Changes the ALPHAPROPSD flag. Changes the sd of the Metropolis-Hastings alpha drawn during update steps.

start_at_pop_info

logical, default FALSE. If TRUE, the STARTATPOPINFO flag in STRUCTURE will be set to 1, in which case the clusters will start at populations during the Metropolis-Hastings search rather than randomly. Requires a defined facet.

metro_update_freq

numeric, default 10. Used if method = "structure". Changes the METROFREQ flag. Sets the rate at which Metropolis-Hastings updates are used. If 0, updates are never used.

seed

integer, default sample(100000, 1). Used if method = "structure". Starting seed for analysis runs. Each additional run (k value or rep) will use a successive seed.

strip_col_names

string, default NULL. An optional regular expression indicating a way to process the column names prior to plotting. Parts of names matching the strings provided will be cut. Useful for when the facet argument is something like "meta$pop". A vector of strings will strip multiple patterns.

cleanup

logical, default TRUE. If TRUE, extra files created during assignment, clumpp, and plot construction will be removed. If FALSE, they will be left in the working directory.

...

additional arguments passed to either main_sNMF or snapclust.choose.k.

Details

Individual cluster assignment probabilities can be calculated using several different methods:

  • snmf: sNMF (sparse Non-Negative Matrix Factorization).

  • snapclust: Maximum-likelihood genetic clustering.

  • admixture: The ADMIXTURE program. Requires a local admixture executable, and thus cannot run on a Windows platform.

  • structure: The STRUCTURE program. Requires a local STRUCTURE executable. many additional options are available for STRUCTURE via other arguments.

These methods are not re-implemented in R, instead, this function calls the main_sNMF, snapclust.choose.k, or a local executable for the ADMIXTURE or STRUCTURE program instead. Please cite the references noted in those functions. For snapclust, the "ward" method is used to initialize clusters if one rep is requested, otherwise the clusters are started randomly each rep. Other methods can be used by providing pop.ini as an additional argument as long as only one rep is requested. Note that at this moment, the snapclust method /emphis not recommended for use by the adegenet package maintainers.

Multiple different runs can be conducted using the 'reps' argument, and the results can be combined for plotting across all of these reps using the clumpp option. This option calls the CLUMPP software package in order to combine proportion population membership across multiple runs via clumppExport from the pophelper package. Again, please cite both CLUMPP and pophelper if using this option.

Since CLUMPP is run independently for each value of K, cluster identities often "flip" between K values. For example individuals that are grouped into cluster 1 and K = 3 may be grouped into cluster 2 at K = 4. To adjust this, cluster IDs are iteratively adjusted across K values by flipping IDs such that the euclidean distances between clusters at K and K - 1 are minimized. This tends to produce consistent cluster IDs across multiple runs of K.

Individuals can be sorted into by membership proportion into different clusters within populations using the qsort option.

Since the clustering and CLUMPP processes can be time consuming and outside tools (such as NGSadmix or fastSTRUCTURE) may be preferred, a nested list of Q matrices, sorted by K and then rep or a character string giving a pattern matching saved Q matrix files in the current working directory may provided directly instead of a snpRdata object. Note that the output for this function, if run on a snpRdata object, will return a properly formatted list of Q files (named 'data') in addition to the plot and plot data. This allows for the plot to be quickly re-constructed using different sorting parameters or facets. In these cases, the facet argument should instead be a vector of group identifications per individuals.

Note that several files will be created in the working directory when using this function that are not automatically cleaned after use.

Value

A list containing:

  • plot: A ggplot object.

  • data: A nested list of the raw Q matrices, organized by K and then by run.

  • plot_data: The raw data used in constructing the ggplot.

  • K_plot: A data.frame containing the value suggested for use in K selection vs K value for the selected method.

Author(s)

William Hemstrom

References

Frichot E, Mathieu F, Trouillon T, Bouchard G, Francois O. (2014). Fast and Efficient Estimation of Individual Ancestry Coefficients. Genetics, 194(4): 973–983.

Frichot, Eric, and Olivier François (2015). LEA: an R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8): 925-929.

Beugin, M. P., Gayet, T., Pontier, D., Devillard, S., & Jombart, T. (2018). A fast likelihood solution to the genetic clustering problem. Methods in ecology and evolution, 9(4), 1006-1016.

Francis, R. M. (2017). pophelper: an R package and web app to analyze and visualize population structure. Molecular ecology resources, 17(1), 27-32.

Jakobsson, M., & Rosenberg, N. A. (2007). CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics, 23(14), 1801-1806.

Examples

## Not run: 
# basic sNMF, k = 2 and 3
plot_structure(stickSNPs, "pop", k = 2:3, clumpp = FALSE)

# basic snapclust
plot_structure(stickSNPs, "pop", k = 2:3, clumpp = FALSE, method = "snapclust")

## End(Not run)

hemstrow/snpR documentation built on March 20, 2024, 7:03 a.m.