calc_geneplot_spdfs: Lower-level function for running the calculation step of...

View source: R/extended_geneplot.R

calc_geneplot_spdfsR Documentation

Lower-level function for running the calculation step of extended_geneplot. Carries out the GenePlot calculations and the saddlepoint distribution calculations.

Description

Use before prepare_plot_params and plot_geneplot_spdfs.

Usage

calc_geneplot_spdfs(
  dat,
  refpopnames,
  locnames,
  includepopnames = NULL,
  quantiles_vec,
  prior,
  logten = T,
  saddlepoint = T,
  leave_one_out = F,
  rel_tol = NULL,
  abs_tol = NULL,
  npts = 1000
)

Arguments

dat

The data, in a data frame, with two columns labelled as 'id' and 'pop', and with two additional columns per locus. Missing data at any locus should be marked as '0' for each allele. The locus columns must be labelled in the format Loc1.a1, Loc1.a2, Loc2.a1, Loc2.a2, etc. Missing data must be for BOTH alleles at any locus. See read_genepop_format for details of how to import Genepop format data files into the appropriate format.

refpopnames

Character vector of reference population names, that must match two values in the 'pop' column of dat. The SPDF methods currently only work for a pair of baseline populations, so refpopnames must be length 2.

locnames

Character vector, names of the loci, which must match the column names in the data so e.g. if dat has columns id, pop, EV1.a1, EV1.a2, EV14.a1, EV14.a2, etc. then you could use 'locnames = c("EV1","EV14") etc. The locnames do not need to be in any particular order but all of them must be in dat.

includepopnames

Character vector (default NULL) of population names to be included in the calculations as comparison populations. The reference populations are automatically used as comparison populations for each other, but you can also add additional comparison populations using includepopnames. For example, if the reference pops are Pop1 and Pop2, and you have some new individuals which you have labelled as PopNew, then use includepopnames=c("PopNew") to compare those individuals to Pop1 and Pop2. You can specify the populations in any order, provided that they are all in dat.

quantiles_vec

Specify which quantiles to show on the plots, as a vector of numbers between 0 and 1. They do not have to be ordered. If NULL, quantiles will not be plotted.

prior

(default "Rannala") String, either "Rannala" or "Baudouin", giving the choice of prior parameter for the Dirichlet priors for the allele frequency estimates. Both options define parameter values that depend on the number of alleles at each locus, k. "Baudouin" gives slightly more weight to rare alleles than "Rannala" does, or less weight to the data, so Baudouin may be more suitable for small reference samples, but there is no major difference between them. For more details, see McMillan and Fewster (2017), Biometrics. Additional options are "Half" or "Quarter" which specify parameters 1/2 or 1/4, respectively. These options have priors whose parameters do not depend on the number of alleles at each locus, and so may be more suitable for microsatellite data with varying numbers of alleles at each locus.

logten

(default TRUE) Boolean, indicates whether to use base 10 for the logarithms, or base e (i.e. natural logarithms). logten=TRUE is default because it's easier to recalculate the original non-log numbers in your head when looking at the plots. Use FALSE for natural logarithms.

saddlepoint

(default TRUE) If TRUE, use saddlepoint approximation to impute Log-Genotype Probability for individual genotypes with missing data. If not, use an empirical approximation to impute the LGPs. Defaults to TRUE because the side plots in the extended GenePlot use the saddlepoint approximation process.

leave_one_out

(default TRUE) Boolean, indicates whether or not to calculate leave-one-out results for any individual from the reference pops. If TRUE, any individual from a reference population will have their Log-Genotype-Probability with respect to their own reference population after temporarily removing the individual's genotype from the sample data for that reference population. The individual's Log-Genotype-Probabilities with respect to all populations they are not a member of will be calculated as normal. We STRONGLY RECOMMEND using leave-one-out=TRUE for any small reference samples (<30).

rel_tol

(default NULL) Specify the relative tolerance for the numerical integration function that is used to calculate the overlap area and also the normalization constants for the various distributions. The default value corresponds to the integrate default i.e. .Machine$double.eps^0.25.

abs_tol

(default NULL) Specify the absolute tolerance for the numerical integration function that is used to calculate the overlap area and also the normalization constants for the various distributions. The default value corresponds to the integrate default i.e. .Machine$double.eps^0.25.

npts

(default 1000) Number of values to use when calculating numerical integrals (for the overlap area measures, and for the normalization constants of the distributions). Increasing this value will increase the precision of the numerical integrals but will also increase the computational cost. Reducing this below 1000 may save some computation time if you are not too concerned with the precision of the results.

Value

A list with the following components:

logprob GenePlot calculation results: Log-Genotype Probability values for all individuals with respect to all of the reference populations. If there are individuals with missing values, their raw LGPs are shown which are based on the loci that are present, and also the imputed LGPs for the full set of loci. This output is the same as the output from calc_logprob.

spdf_vals1 Saddlepoint distribution approximations with the first reference population as the baseline. List contains xvals, yvals, wvals and zvals, the raw distribution curves for plotting the distributions. This list also contains oavals, probvals and diffvals, which are the Overlap Area, Incumbent Selection Probability and Home Assignment Probability values for the given baseline population with all the other populations as comparisons. The sub-list also records the quantile values requested, the name of the given baseline pop for this sub-list as refpopA, the indices of the baseline pop and comparison pop in the reference pops, the name of the other reference population and the number of other reference populations (always equal to one).

spdf_vals2 As for spdf_vals1, but with the second reference population as the baseline.


lfmcmillan/geneplot documentation built on Nov. 27, 2024, 1:35 a.m.