calc_spdfs: Run GenePlot directional analysis of the connectivity of two...
In lfmcmillan/geneplot: Genetic Assignment and Plotting

calc_spdfs

R Documentation

Run GenePlot directional analysis of the connectivity of two populations.

Description

This function calculates three measures of genetic differentiation between populations that relate to assignment probabilities. It also calculates the values for the auxiliary curves in the extended GenePlot (see also extended_geneplot.

Usage

calc_spdfs(
  dat = NULL,
  refpopnames,
  locnames = NULL,
  includepopnames = NULL,
  allele_freqs = NULL,
  display_names = refpopnames,
  plot_spdfs = FALSE,
  difference_threshold = 0,
  show_quantiles = TRUE,
  quantiles_vec = c(0.01, 1),
  prior = "Rannala",
  logten = T,
  leave_one_out = T,
  calc_positive_stats = TRUE,
  calc_details = FALSE,
  calc_vecs = TRUE,
  output_as_vectors = FALSE,
  rel_tol = NULL,
  abs_tol = NULL,
  npts = 1000,
  only_plot_baseline_pop = FALSE,
  show_statistics_on_plot = TRUE,
  line_cols = NULL,
  line_widths = NULL,
  title_text = NULL,
  title_text_difference = NULL,
  ISP_xlim = NULL,
  ISP_ylim = NULL,
  HAP_xlim = NULL,
  HAP_ylim = NULL,
  ISP_legend_xy = NULL,
  HAP_legend_xy = NULL,
  axis_labels = TRUE
)

Arguments

`dat`	The data, in a data frame, with two columns labelled as 'id' and 'pop', and with two additional columns per locus. Missing data at any locus should be marked as '0' for each allele. The locus columns must be labelled in the format Loc1.a1, Loc1.a2, Loc2.a1, Loc2.a2, etc. Missing data must be for BOTH alleles at any locus. See `read_genepop_format` for details of how to import Genepop format data files into the appropriate format. The user must supply either an input data frame `dat` and a vector of loci names `locnames`, or a list of allele frequencies at all relevant loci for all the populations, `allele_freqs`. The `allele_freqs` list can be obtained as an attribute of the output of GenePlot `logprob` calculations, and may cover more populations than the two used in this SPDF analysis (the SPDF analysis populations are indicated in the argument `refpopnames`)
`refpopnames`	Character vector of reference population names, that must match two values in the 'pop' column of `dat`. The SPDF methods currently only work for a pair of baseline populations, so `refpopnames` must be length 2.
`locnames`	Character vector, names of the loci, which must match the column names in the data so e.g. if dat has columns id, pop, EV1.a1, EV1.a2, EV14.a1, EV14.a2, etc. then you could use 'locnames = c("EV1","EV14") etc. The locnames do not need to be in any particular order but all of them must be in `dat`.
`includepopnames`	Character vector (default NULL) of population names to be included in the calculations as comparison populations. The reference populations are automatically used as comparison populations for each other, but you can also add additional comparison populations using `includepopnames`. For example, if the reference pops are Pop1 and Pop2, and you have some new individuals which you have labelled as PopNew, then use `includepopnames=c("PopNew")` to compare those individuals to Pop1 and Pop2. You can specify the populations in any order, provided that they are all in `dat`.
`allele_freqs`	(default=NULL) Alternative input format, which you can supply instead of `dat` and `locnames`. You can calculate the allele frequencies object using the `calc_logprob` function.
`display_names`	(default=refpopnames) Use this to supply alternative display names for the populations. The refpopnames, as columns in the dataset, cannot have spaces, for example, whereas the display names can have spaces.
`plot_spdfs`	(default=FALSE) If true, display plots of the genetic distributions as part of the output. Not needed if calculating the values for an extended GenePlot. The first two plots have the first reference pop as the baseline, then the last two have the second reference pop as the baseline. Within each pair, one plot shows the baseline population genetic distribution and the distribution of the comparison population relative to it, and the second shows the distribution of the differences between fit to the baseline and the comparison, for all genotypes that could arise from the baseline population. In each "differences" plot, the values above 0 are from genotypes that have a better fit to their own population, the baseline, than to the comparison population, and the values below 0 are from genotypes that have a better fit to the comparison population than their own baseline population.
`difference_threshold`	(default=0) When this is zero, the Home Assignment Probability is the probability of a random individual from the baseline B having a better fit to B than A. If you want to instead calculate the probability that the individual from B has 10x better fit to B than A, then set `difference_threshold` to 1, because log10(10) = 1 (for `logten = TRUE`) and the probabilities are on a log scale. Positive values make the measure calculation more conservative, negative values make the measure calculation less conservative. Only applies to the Home Assignment Probability, not the other measures.
`show_quantiles`	(default=TRUE) If TRUE, show quantiles on the distribution plots as vertical lines. For example, on the distribution of the baseline population, if one of the quantiles is 0.01 (1 the vertical line will show that 1 from the population will have a worse fit than the quantile value and 99 Ignored if `plot_spdfs = FALSE`.
`quantiles_vec`	(default=0.01) Specify which quantiles to show on the distribution plots, as a vector of numbers between 0 and 1. They do not have to be ordered.
`prior`	(default="Rannala") String, either "Rannala" or "Baudouin", giving the choice of prior parameter for the Dirichlet priors for the allele frequency estimates. Both options define parameter values that depend on the number of alleles at each locus, k. "Baudouin" gives slightly more weight to rare alleles than "Rannala" does, or less weight to the data, so Baudouin may be more suitable for small reference samples, but there is no major difference between them. For more details, see McMillan and Fewster (2017), Biometrics. Additional options are "Half" or "Quarter" which specify parameters 1/2 or 1/4, respectively. These options have priors whose parameters do not depend on the number of alleles at each locus, and so may be more suitable for microsatellite data with varying numbers of alleles at each locus.
`logten`	(default TRUE) Boolean, indicates whether to use base 10 for the logarithms, or base e (i.e. natural logarithms). logten=TRUE is default because it's easier to recalculate the original non-log numbers in your head when looking at the plots. Use FALSE for natural logarithms.
`leave_one_out`	Boolean (default TRUE), indicates whether or not to calculate leave-one-out results for any individual from the reference pops. If TRUE, any individual from a reference population will have their Log-Genotype-Probability with respect to their own reference population after temporarily removing the individual's genotype from the sample data for that reference population. The individual's Log-Genotype-Probabilities with respect to all populations they are not a member of will be calculated as normal. We STRONGLY RECOMMEND using leave-one-out=TRUE for any small reference samples (<30).
`calc_positive_stats`	(default=TRUE) Which form of the genetic measures to calculate. If TRUE, the Incumbent Selection Probability calculates the probability that for two random individuals from baseline B and comparison A, the one from B has the best fit to B. The Home Assignment Probability calculates the probability that a random individual from baseline B has a better fit to B than A. If FALSE, the Incumbent Selection Probability becomes instead the Interloper Selection Probability, ie. the probability that the individual from A has a better fit to B than the individual from B. And the Home Assignment Probability becomes the Away Assignment Probability, i.e. the probability that the individual from B has a better fit to A than to B.
`calc_details`	(default=FALSE) If TRUE, the function displays additional statistics relating to the genetic distribution of each of the reference populations.
`calc_vecs`	(default=TRUE) If TRUE, the function calculates the three measures in both directions, and calculates the full distribution curves required for the extended GenePlot. Therefore leave this as the default (TRUE) if using this function prior to plotting the extended GenePlot.
`output_as_vectors`	(default=FALSE) By default, the output measures are for the two reference populations and any additional included populations are collated into a matrix of values, where each column represents one of the reference populations as the baseline, and the comparison populations are in the rows. If `output_as_vectors` is TRUE, the outputs are instead in vector format, where the columns indicate the baseline and comparison populations for each value (e.g. PopB.PopA is the value for baseline B and comparison A).
`rel_tol`	(default=NULL) Specify the relative tolerance for the numerical integration function that is used to calculate the overlap area and also the normalization constants for the various distributions. The default value corresponds to the `integrate` default i.e. `.Machine$double.eps^0.25`.
`abs_tol`	(default=NULL) Specify the absolute tolerance for the numerical integration function that is used to calculate the overlap area and also the normalization constants for the various distributions. The default value corresponds to the `integrate` default i.e. `.Machine$double.eps^0.25`.
`npts`	(default=1000) Number of values to use when calculating numerical integrals (for the overlap area measures, and for the normalization constants of the distributions). Increasing this value will increase the precision of the numerical integrals but will also increase the computational cost. Reducing this below 1000 may save some computation time if you are not too concerned with the precision of the results.
`only_plot_baseline_pop`	(default=FALSE) Ignored if `plot_spdfs=FALSE`. By default, the first and third plots in the output show the genetic distribution of the baseline pop, and also the genetic distribution of the comparison pop against the baseline pop. If TRUE, only plot the distribution of the baseline pop.
`show_statistics_on_plot`	(default=TRUE) Ignored if `plot_spdfs=FALSE`. If TRUE, state the incumbent selection probabilities and home assignment probabilities beneath their respective plots.
`line_cols`	(default=NULL) Ignored if `plot_spdfs=FALSE`. Vector of line colours for the populations, starting with the two reference populations in the same order as in `refpopnames`. You can use any R colour specification, including named colours.
`line_widths`	(default=NULL) Ignored if `plot_spdfs=FALSE`. Vector of line widths for the populations, starting with the two reference populations in the same order as in `refpopnames`. Standard R line widths.
`title_text`	(default=NULL) Ignored if `plot_spdfs=FALSE`. Title text for the Incumbent Selection Probability plots.
`title_text_difference`	(default=NULL) Ignored if `plot_spdfs=FALSE`. Title text for the Home Assignment Probability plots.
`ISP_xlim`	(default=NULL) Ignored if `plot_spdfs=FALSE`. x-axis limits for the Incumbent Selection Probability plots.
`ISP_ylim`	(default=NULL) Ignored if `plot_spdfs=FALSE`. y-axis limits for the Incumbent Selection Probability plots.
`HAP_xlim`	(default=NULL) Ignored if `plot_spdfs=FALSE`. x-axis limits for the Home Assignment Probability plots.
`HAP_ylim`	(default=NULL) Ignored if `plot_spdfs=FALSE`. y-axis limits for the Home Assignment Probability plots.
`ISP_legend_xy`	(default=NULL) Ignored if `plot_spdfs=FALSE`. x-y position for the legend in Incumbent Selection Probability plots.
`HAP_legend_xy`	(default=NULL) Ignored if `plot_spdfs=FALSE`. x-y position for the legend in Home Assignment Probability plots.
`axis_labels`	(default=TRUE) Ignored if `plot_spdfs=FALSE`. If TRUE, include axis labels on all the SPDF plots.

Details

All three measures are directional, meaning that for two populations, A and B, there is an A to B value and a separate B to A value. The function calculates all three measures in both directions, i.e. with each of the reference populations as the baseline in turn.

1) Overlap Area (OA): the area of overlap between the two auxiliary curves in one side of the extended GenePlot.

2) Incumbent Selection Probability (ISP): what is the probability that if you take a random individual from baseline population B and another from comparison population A, the individual from B has a better fit to B than the individual from A has to population B? As an analogy, if you asked a random Dutch child and a random English child to take an English language test, what is the probability that the English child gets a higher mark than the Dutch child? (England is the baseline population B in this analogy.)

3) Home Assignment Probability (HAP): what is the probability that if you take a random individual from baseline population B, the individul has better fit to population B than population A? As an analogy, if you pick a random English child and give them an English test and a Dutch test, what is the probability that they will do better at the English test? (England is again the baseline population in this analogy.)

The three measures are based on saddlepoint approximations to various distributions relating to the fit of potential individual genotypes to two candidate source populations. "spdf" stands for Saddlepoint Probability Density Function, because the values calculated are based on probability density functions.

By default the three directional measures are shown as matrices with the baseline populations as the columns and the comparison populations as the rows.

By default, the outputs also include the full genetic distributions of the two reference populations, approximated using the saddlepoint approximation. The genetic distribution of a population B is the distribution of log-genotype probabilities for all possible genotypes that could arise from the population, given its estimated allele frequencies. It shows the range of fits to the population that are possible for individuals that could arise from it.

Value

A list with components:

overlap_results: Overlap Area values calculated using each of the two reference populations as the baseline in turn. Comparison populations are the other reference population and any additional included populations.

incumbent_results: Incumbent Selection Probability values calculated using each of the two reference populations as the baseline in turn. Comparison populations are the other reference population and any additional included populations.

home_assignment_results: Home Assignment Probability values calculated using each of the two reference populations as the baseline in turn. Comparison populations are the other reference population and any additional included populations.

spdf_results: List with two sub-lists, one for each reference population as the baseline. Within a sublist, the entries xvals, yvals, wvals and zvals are blank unless argument plot_sdfs = TRUE or calc_vecs = TRUE. They are the raw distribution curves for plotting the ISP and HAP values. Each sub-list also contains oavals, probvals and diffvals, which are the Overlap Area, Incumbent Selection Probability and Home Assignment Probability values for the given baseline population with all the other populations as comparisons. The sub-list also records the quantile values requested, the name of the given baseline pop for this sub-list as refpopA, the indices of the baseline pop and comparison pop in the reference pops, the name of the other reference population and the number of other reference populations (always equal to one).

Author(s)

Saddlepoint approximation to distributions of genetic fit to populations developed by McMillan and Fewster, based on calculations of Log-Genotype-Probability from the method of Rannala and Mountain (1997) as implemented in GeneClass2, updated to allow for individuals with missing data and to enable accurate calculations of quantiles of the Log-Genotype-Probability distributions of the reference populations. See McMillan and Fewster (2017) for details.

References

McMillan, L. F., "Concepts of statistical analysis, visualization, and communication in population genetics" (2019). Doctoral thesis, https://researchspace.auckland.ac.nz/handle/2292/47358.

McMillan, L. and Fewster, R. "Visualizations for genetic assignment analyses using the saddlepoint approximation method" (2017) Biometrics. Rannala, B., and Mountain, J. L. (1997). Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences 94, 9197–9201.

Piry, S., Alapetite, A., Cornuet, J.-M., Paetkau, D., Baudouin, L., and Estoup, A. (2004). GENECLASS2: A software for genetic assignment and first-generation migrant detection. Journal of Heredity 95, 536–539.

lfmcmillan/geneplot documentation built on Nov. 27, 2024, 1:35 a.m.