extended_geneplot: Plot GenePlot with two additional side plots showing...

View source: R/extended_geneplot.R

extended_geneplotR Documentation

Plot GenePlot with two additional side plots showing directional analyses of the connectivity of the two reference populations.

Description

The side plots show the genetic distribution of each reference population and the other population compared with it. The overlap area within each of the side plots corresponds to the Overlap Area measures obtained by calc_spdfs.

Usage

extended_geneplot(
  dat,
  refpopnames,
  locnames,
  includepopnames = NULL,
  quantiles_vec,
  colvec = NULL,
  shapevec = NULL,
  line_widths = NULL,
  orderpop = NULL,
  axispop = NULL,
  display_names = NULL,
  xyrange = NULL,
  ylim_input = NULL,
  mark_impute = T,
  geneplot_multiplier = 3,
  show_overlap_areas = F,
  show_legend = T,
  show_legend_below = T,
  legend_width = NULL,
  show_title = T,
  title_text = NULL,
  grayscale_quantiles = F,
  show_include_ids = F,
  prior = "Rannala",
  leave_one_out = F,
  logten = T,
  saddlepoint = T,
  rel_tol = .Machine$double.eps^0.25,
  abs_tol = rel_tol,
  npts = 1000
)

Arguments

dat

The data, in a data frame, with two columns labelled as 'id' and 'pop', and with two additional columns per locus. Missing data at any locus should be marked as '0' for each allele. The locus columns must be labelled in the format Loc1.a1, Loc1.a2, Loc2.a1, Loc2.a2, etc. Missing data must be for BOTH alleles at any locus. See read_genepop_format for details of how to import Genepop format data files into the appropriate format.

refpopnames

Character vector of reference population names, that must match two values in the 'pop' column of dat. The SPDF methods currently only work for a pair of baseline populations, so refpopnames must be length 2.

locnames

Character vector, names of the loci, which must match the column names in the data so e.g. if dat has columns id, pop, EV1.a1, EV1.a2, EV14.a1, EV14.a2, etc. then you could use 'locnames = c("EV1","EV14") etc. The locnames do not need to be in any particular order but all of them must be in dat.

includepopnames

Character vector (default NULL) of population names to be included in the calculations as comparison populations. The reference populations are automatically used as comparison populations for each other, but you can also add additional comparison populations using includepopnames. For example, if the reference pops are Pop1 and Pop2, and you have some new individuals which you have labelled as PopNew, then use includepopnames=c("PopNew") to compare those individuals to Pop1 and Pop2. You can specify the populations in any order, provided that they are all in dat.

quantiles_vec

Specify which quantiles to show on the plots, as a vector of numbers between 0 and 1. They do not have to be ordered. If NULL, quantiles will not be plotted.

colvec

(Optional) Vector of colours for the populations, starting with the reference populations in the order refpopnames and followed by any included populations in the order includepopnames. The same colours are used in the GenePlot and the side plots. Colours can be specified using rgb objects, hexadecimal codes, or any of the R colour names (see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for a PDF of R colours).

shapevec

(Optional) Vector of shapes for the populations plotted on the central GenePlot. These are named shapes from the following list: "Circle", "Square", "Diamond", "TriangleUp", "TriangleDown", "OpenSquare", "OpenCircle", "OpenTriangleUp", "Plus", "Cross", "OpenDiamond", "OpenTriangleDown", "Asterisk" which correspond to the following pch values for R plots: 21, 22, 23, 24, 25, 0, 1, 2, 3, 4, 5, 6, 8. Do not use the numbers, use the words, which will be automatically converted within plot_logprob into the appropriate codes.

line_widths

(Optional) Vector of line widths for all the populations for the genetic distribution side plots. Default widths are 4. Length must be the number of reference populations plus the number of additional included populations.

orderpop

(Optional) Vector of names of the reference populations and include populations, to indicate the order their points should be plotted within the GenePlot. The first will be plotted first and so will appear to be "beneath" all the other populations, and the last will be plotted last and so will appear to be "above" all the the other populations. Use if you have a particular population whose individuals you are interested in and need to see on top of the rest: put this population last in orderpop. Default is NULL, in which case populations are plotted in order of size, so the population with the largest number of points is plotted at the bottom, and the population with the smallest number of individuals/points is plotted over the top, so as not to be obscured.

axispop

(Optional) Vector of reference populations, indicating which to plot as the baseline on the x-axis and which on the y-axis. The default is that the first entry in refpopnames is plotted on the x-axis.

display_names

(default refpopnames) Use this to supply alternative display names for the populations. The refpopnames, as columns in the dataset, cannot have spaces, for example, whereas the display names can have spaces.

xyrange

(Optional) Numerical limits for the GenePlot axes, which also form the x-axes of the two side plots. The plot is always symmetrical, so the same limits will be used for the x and the y axes. This is to ensure that comparisons between the two reference populations are fair. The values are a range of Log-Genotype Probability values so the maximum is 0. Default is slightly wider than the range of the calculated Log-Genotype Probabilities for all individuals in the plot.

ylim_input

(Optional) Numerical limits for the y-axes of the two side plots. Has no effect on the central GenePlot. Run with the defaults first to see what the range of values is for the given populations.

mark_impute

(default FALSE) Boolean, indicates whether to mark individuals with missing data using asterisks on the GenePlot.

geneplot_multiplier

(default 3) Ratio of width of GenePlot to sideplots. Default is 3, i.e. the GenePlot height will be three times the height of the bottom plot and the GenePlot width will be three times the width of the left-hand plot.

show_overlap_areas

(default FALSE) If TRUE, print the numerical Overlap Area values beneath the extended plot.

show_legend

(default TRUE) If TRUE, plot the legend.

show_legend_below

(default TRUE) If TRUE, show the legend as a row of coloured shapes and labels below the extended plot. If FALSE, show the legend within the GenePlot. Default is TRUE because putting the legend inside the central GenePlot can hide points from individual genotypes.

legend_width

(default 4cm) Change the width of the displayed legend. Units are cm.

show_title

(default TRUE) Include a title for the plot.

title_text

(Optional) Provide alternative title text for the plot.

grayscale_quantiles

(default FALSE) Used for plots with 2 reference pops. FALSE (default) plots the quantile lines using colvec colours TRUE plots the quantile lines in gray (as the default colours can be quite pale, the grayscale quantile lines can be easier to see than the default coloured ones).

show_include_ids

(default FALSE) Ignored unless includepopnames is supplied. If TRUE, plots individual genotypes from the additional included pops as points but also displays the ID for each genotype. This can be useful if one or more genotype shows unusual patterns of fit to the reference populations and you want to identify those genotypes.

prior

(default "Rannala") String, either "Rannala" or "Baudouin", giving the choice of prior parameter for the Dirichlet priors for the allele frequency estimates. Both options define parameter values that depend on the number of alleles at each locus, k. "Baudouin" gives slightly more weight to rare alleles than "Rannala" does, or less weight to the data, so Baudouin may be more suitable for small reference samples, but there is no major difference between them. For more details, see McMillan and Fewster (2017), Biometrics. Additional options are "Half" or "Quarter" which specify parameters 1/2 or 1/4, respectively. These options have priors whose parameters do not depend on the number of alleles at each locus, and so may be more suitable for microsatellite data with varying numbers of alleles at each locus.

leave_one_out

(default TRUE) Boolean, indicates whether or not to calculate leave-one-out results for any individual from the reference pops. If TRUE, any individual from a reference population will have their Log-Genotype-Probability with respect to their own reference population after temporarily removing the individual's genotype from the sample data for that reference population. The individual's Log-Genotype-Probabilities with respect to all populations they are not a member of will be calculated as normal. We STRONGLY RECOMMEND using leave-one-out=TRUE for any small reference samples (<30).

logten

(default TRUE) Boolean, indicates whether to use base 10 for the logarithms, or base e (i.e. natural logarithms). logten=TRUE is default because it's easier to recalculate the original non-log numbers in your head when looking at the plots. Use FALSE for natural logarithms.

saddlepoint

(default TRUE) If TRUE, use saddlepoint approximation to impute Log-Genotype Probability for individual genotypes with missing data. If not, use an empirical approximation to impute the LGPs. Defaults to TRUE because the side plots in the extended GenePlot use the saddlepoint approximation process.

rel_tol

(default NULL) Specify the relative tolerance for the numerical integration function that is used to calculate the overlap area and also the normalization constants for the various distributions. The default value corresponds to the integrate default i.e. .Machine$double.eps^0.25.

abs_tol

(default NULL) Specify the absolute tolerance for the numerical integration function that is used to calculate the overlap area and also the normalization constants for the various distributions. The default value corresponds to the integrate default i.e. .Machine$double.eps^0.25.

npts

(default 1000) Number of values to use when calculating numerical integrals (for the overlap area measures, and for the normalization constants of the distributions). Increasing this value will increase the precision of the numerical integrals but will also increase the computational cost. Reducing this below 1000 may save some computation time if you are not too concerned with the precision of the results.

Details

Suppose populations A and B are plotted on the GenePlot with the x-axis showing fit to B (i.e. Log-Genotype Probabilities with respect to B) and the y-axis showing fit to A (i.e. Log-Genotype Probabilities with respect to A).

Then the auxiliary plot at the bottom of the GenePlot will show the saddlepoint distribution plots for baseline B. The solid curve shows the genetic distribution of baseline B with itself: it is the distribution of Log-Genotype Probabilities for all genotypes that could arise from B. The dashed curve matching the colour of the other reference population, A, shows the comparison distribution of A into B. For all the genotypes that could arise from B, this shows how often those would arise in A. If B and A are very different genetically then a lot of the genotypes that would have a very good fit to B may only occur rarely in A, and the genotypes that occur commonly in A may have a poor fit to B, and this would be shown by a low amount of overlap between these two curves. On the other hand, if B and A are genetically similar then genotypes that commonly occur in A also have a good fit to B, and so there will be a high amount of overlap between the two curves.

The auxiliary plot on the left of the GenePlot will show similar curves, but for baseline A, so the solid curve is the genetic distribution of A with itself and the dashed curve is the genetic distribution of B into A. A low overlap here means that genotypes that commonly occur in B would have a poor fit to A. A high overlap here means that genotypes that commonly occur in B would have a good fit to A as well as B.

The user can also specify includepopnames, which are additional populations or groups of individuals in the dataset to be plotted on the central GenePlot, to show their fit to reference populations A and B.

This function runs the GenePlot and saddlepoint distribution calculations and produces the plot. If you want to split these up into separate code steps, then run calc_geneplot_spdfs to perform the calculations, prepare_plot_params to set up the plotting details and plot_geneplot_spdfs to produce the extended plot.

Value

Output is the output of calc_geneplot_spdfs

Author(s)

Saddlepoint approximation to distributions of genetic fit to populations developed by McMillan and Fewster, based on calculations of Log-Genotype-Probability from the method of Rannala and Mountain (1997) as implemented in GeneClass2, updated to allow for individuals with missing data and to enable accurate calculations of quantiles of the Log-Genotype-Probability distributions of the reference populations. See McMillan and Fewster (2017) for details.

References

McMillan, L. F., "Concepts of statistical analysis, visualization, and communication in population genetics" (2019). Doctoral thesis, https://researchspace.auckland.ac.nz/handle/2292/47358.

McMillan, L. and Fewster, R. "Visualizations for genetic assignment analyses using the saddlepoint approximation method" (2017) Biometrics.

Rannala, B., and Mountain, J. L. (1997). Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences 94, 9197–9201.

Piry, S., Alapetite, A., Cornuet, J.-M., Paetkau, D., Baudouin, L., and Estoup, A. (2004). GENECLASS2: A software for genetic assignment and first-generation migrant detection. Journal of Heredity 95, 536–539.

See Also

calc_spdfs, calc_geneplot_spdfs, plot_geneplot_spdfs, prepare_plot_params


lfmcmillan/geneplot documentation built on Nov. 27, 2024, 1:35 a.m.