doRun: Run Approximate Bayesian Computation for Phylogenetic...
In bomeara/treevo: Using ABC to Understand Trait Evolution

doRun

R Documentation

Run Approximate Bayesian Computation for Phylogenetic Comparative Methods

Description

The doRun functions are the main interface for TreEvo users to do Approximate Bayesian Computation (ABC) analysis, effectively wrapping the simulateWithPriors functions to perform simulations, which themselves are wrappers for the doSimulation function. The two current doRun functions are doRun_prc which applies a partial rejection control (PRC) ABC analysis over multiple generations of simulations, and doRun_rej which performs a full rejection ('rej') ABC analysis.

Usage

doRun_prc(
  phy,
  traits,
  intrinsicFn,
  extrinsicFn,
  startingPriorsValues,
  startingPriorsFns,
  intrinsicPriorsValues,
  intrinsicPriorsFns,
  extrinsicPriorsValues,
  extrinsicPriorsFns,
  numParticles = 300,
  nStepsPRC = 5,
  nRuns = 2,
  nInitialSims = NULL,
  nInitialSimsPerParam = 100,
  generation.time = 1000,
  TreeYears = max(branching.times(phy)) * 1e+06,
  multicore = FALSE,
  coreLimit = NA,
  multicoreSuppress = FALSE,
  standardDevFactor = 0.2,
  epsilonProportion = 0.7,
  epsilonMultiplier = 0.7,
  validation = "CV",
  scale = TRUE,
  variance.cutoff = 95,
  stopRule = FALSE,
  stopValue = 0.05,
  maxAttempts = Inf,
  diagnosticPRCmode = FALSE,
  jobName = NA,
  saveData = FALSE,
  verboseParticles = TRUE
)

doRun_rej(
  phy,
  traits,
  intrinsicFn,
  extrinsicFn,
  startingPriorsValues,
  startingPriorsFns,
  intrinsicPriorsValues,
  intrinsicPriorsFns,
  extrinsicPriorsValues,
  extrinsicPriorsFns,
  generation.time = 1000,
  TreeYears = max(branching.times(phy)) * 1e+06,
  multicore = FALSE,
  coreLimit = NA,
  validation = "CV",
  scale = TRUE,
  nInitialSims = NULL,
  nInitialSimsPerParam = 100,
  variance.cutoff = 95,
  standardDevFactor = 0.2,
  jobName = NA,
  abcTolerance = 0.1,
  checkpointFile = NULL,
  checkpointFreq = 24,
  savesims = FALSE
)

Arguments

`phy`	A phylogenetic tree, in package `ape`'s `phylo` format.
`traits`	Data matrix with rownames identical to `phy@tip.label`. If a vector, `traits` will be coerced to a matrix, with element names as rownames.
`intrinsicFn`	Name of (previously-defined) function that governs how traits evolve within a lineage, regardless of the trait values of other taxa.
`extrinsicFn`	Name of (previously-defined) function that governs how traits evolve within a lineage, based on their own ('internal') trait vlaue and the trait values of other taxa.
`startingPriorsValues`	A list of the same length as the number of prior distributions specified in `startingPriorsFns` (for starting values, this should be one prior function specified for each trait - thus one for most univariate trait analyses), with each element of the list a vector the same length as the appropriate number of parameters for that prior distribution (1 for `"fixed"`, 2 for `"uniform"`, 2 for `"normal"`, 2 for `"lognormal"`, 2 for `"gamma"`, 1 for `"exponential"`).
`startingPriorsFns`	Vector containing names of prior distributions to use for root states: can be one of `"fixed"`, `"uniform"`, `"normal"`, `"lognormal"`, `"gamma"`, `"exponential"`.
`intrinsicPriorsValues`	A list of the same length as the number of prior distributions specified in `intrinsicPriorsFns` (one prior function specified for each parameter in the intrinsic model), with each element of the list a vector the same length# as the appropriate number of parameters for that prior distribution (1 for `"fixed"`, 2 for `"uniform"`, 2 for `"normal"`, 2 for `"lognormal"`, 2 for `"gamma"`, 1 for `"exponential"`).
`intrinsicPriorsFns`	Vector containing names of prior distributions to use for intrinsic function parameters: can be one of `"fixed"`, `"uniform"`, `"normal"`, `"lognormal"`, `"gamma"`, `"exponential"`.
`extrinsicPriorsValues`	A list of the same length as the number of prior distributions specified in `extrinsicPriorsFns` (one prior function specified for each parameter in the extrinsic model), with each element of the list a vector the same length as the appropriate number of parameters for that prior distribution (1 for `"fixed"`, 2 for `"uniform"`, 2 for `"normal"`, 2 for `"lognormal"`, 2 for `"gamma"`, 1 for `"exponential"`).
`extrinsicPriorsFns`	Vector containing names of prior distributions to use for extrinsic function parameters: can be one of `"fixed"`, `"uniform"`, `"normal"`, `"lognormal"`, `"gamma"`, `"exponential"`.
`numParticles`	Number of accepted particles per PRC generation.
`nStepsPRC`	Number of PRC generations to run.
`nRuns`	Number of independent PRC runs to be performed, each consisting of independent sets of initial simulations and PRC generations. Note that runs are run sequentially, and not in parallel, as the generation of particles within each run makes use of the multicore functionality. If `nRuns` is greater than 1, the output from `doRun_prc` will be a list object composed of multiple output lists, as described.
`nInitialSims`	Number of initial simulations used to calibrate particle rejection control algorithm. If not given, this will be a function of the number of freely varying parameters.
`nInitialSimsPerParam`	If `nInitialSims` is not given by the user, the number of initial simulations performed to calibrate the particle rejection algorithm will instead be the number of free parameters in the model multiplied by `nInitialSimsPerParam`.
`generation.time`	The number of years per generation. This sets the coarseness of the simulation; if it's set to 1000, for example, the population's trait values change every 1000 calendar years. Note that this is in calendar years (see description for argument `TreeYears`), and not in millions of years (as is typical for dated trees in macroevolutionary studies). Thus, if a branch is 1 million-year time-unit long, and a user applies the default `generation.time = 1000`, then 1000 evolutionary changes will be simulated along that branch. See documentation for `doSimulation` for further details.
`TreeYears`	The amount of calendar time from the root to the furthest tip. Most trees in macroevolutionary studies are dated with branch lengths in units of millions of years, and thus the default for this argument is `max(branching.times(phy)) * 1e6`. If your tree has the most recent tip at time zero (i.e., the modern day), this would be the same as the root age of the tree. If your branch lengths are not in millions of years, you should alter this argument. Otherwise, leave this argument alone. See documentation for `doSimulation` for further details.
`multicore`	Whether to use multicore, default is `FALSE`. If `TRUE`, one of two suggested packages must be installed, either `doMC` (for UNIX systems) or `doParallel` (for Windows), which are used to activate multithreading. If neither package is installed, this function will fail if `multicore = TRUE`.
`coreLimit`	Maximum number of cores to be used.
`multicoreSuppress`	This argument suppresses the multicore code and will apply a plain vanilla serial `for()` loop instead of `doPar`. Mainly to be used for developer diagnostic purposes.
`standardDevFactor`	Standard deviation for mutating states each time a new particle is generated in a PRC generation.
`epsilonProportion`	Sets tolerance for initial simulations.
`epsilonMultiplier`	Sets tolerance on subsequent PRC generations.
`validation`	Character argument controlling what validation procedure is used by `plsr`. Default is `"CV"` for cross-validation.
`scale`	This argument is passed to `plsr`. It may be a numeric vector, or logical. If numeric vector, the input is scaled by dividing each variable with the corresponding element of scale. If `scale = TRUE`, the inpus is scaled by dividing each variable by its sample standard deviation. If cross-validation is selected (the default for `returnPLSModel`), scaling by the standard deviation is done for every segment.
`variance.cutoff`	Minimum threshold percentage of variance explained for the number of components included in the final PLS model fit. This value is a percentage and must be between 0 and 100. Default is 95 percent.
`stopRule`	If `TRUE`, an analysis will be terminated prior to set number of generations if the ratio of all parameters from one generation to the next falls below the `stopValue`.
`stopValue`	Threshold value for terminating an analysis prior to `nStepesPRC`.
`maxAttempts`	Maximum number attempts made in `while` loop within the PRC algorithm. If reached, the algorithm will terminate with an error message. By default, this is infinite, and thus there is no effective limit without user intervention.
`diagnosticPRCmode`	If `TRUE` (not the default), the function will be very noisy about characteristics of the PRC algorithm as it runs.
`jobName`	Optional job name.
`saveData`	Option to save various run information during the analysis, including summary statistics from analyses, output to external .Rdata and .txt files.
`verboseParticles`	If `TRUE` (the default), a large amount of information about parameter estimates and acceptance of particles is output to console via `message` as `doRun_prc` runs.
`abcTolerance`	Proportion of accepted simulations.
`checkpointFile`	Optional file name for checkpointing simulations.
`checkpointFreq`	Saving frequency for checkpointing.
`savesims`	Option to save individual simulations, output to a .Rdata file.

Details

Both doRun functions take an input phylogeny (phy), observed trait data set (traits), models (intrinsicFn, extrinsicFn), and priors (startingPriorsValues, startingPriorsFns, intrinsicPriorsValues, intrinsicPriorsFns, extrinsicPriorsValues, extrinsicPriorsFns). Pulling from the priors, it simulates an initial set of simulations (nInitialSims). This set of simulations is boxcox transformed , and a PLS regression (see methodsPLS) is performed for each free parameter to determine the most informative summary statistics (using variance.cutoff). The euclidean distance is calculated between each each initial simulation's most informative summary statistics and the input observed data.

Following that step, the approach of the two functions diverge drastically.

For function doRun_rej, those simulations whose most informative summary statistics fall below abcTolerance are kept as accepted 'particles' (simulations runs), describing the posterior distribution of parameters. No additional generations of simulations are performed.

Coversely, function doRun_prc performs an ABC-PRC analysis, which follows a much more complicated algorithm with additional generations. In PRC, tolerance is set based on epsilonProportion and epsilonMultiplier, and the analysis begins with generation 1 and continues through a number of generations equal to nStepsPRC. Single simulation runs ('particles') are accepted if the distance of the most informative summary stats to the original data is less than this tolerance. A generation is complete when enough particles (numParticles) have been accepted. These particles make up the distribution for the next generation. The accepted particles from the final generation describe the posterior distributions of parameters.

Value

The output of these two functions are lists, composed of multiple objects, which differ slightly in their content among the two functions. For doRun_prc, the output is:

input.data: Input variables: jobName, nTaxa (number of taxa), nInitialSims, generation.time, TreeYears, timeStep, totalGenerations, epsilonProportion, epsilonMultiplier, nRuns (number of runs), nStepsPRC, numParticles, standardDevFactor.
priorList: List of prior distributions. This is used for doing post-analysis comparisons between prior and posterior distributions, such as with function plotPosteriors.
particleDataFrame: DataFrame with information from each simulation, including generation, attempt, id, parentid, distance, weight, and parameter states.
toleranceVector: Final tolerance vector.
phy: Input phylogeny.
traits: Input traits.
simTime: Processor time for initial simulations.
time.per.gen: Processor time for subsequent generations.
postSummary: Summarizes the posterior distribution from the final generation for all free parameters, giving the mean, standard deviation and Highest Posterior Density (at a 0.8 alpha) for each parameter.

If nRuns is greater than 1, the output from doRun_prc will be a list object composed of multiple output lists, as described.

For doRun_rej, the output is:

input.data: Input variables: jobName, nTaxa (number of taxa), nInitialSims, generation.time, TreeYears, timeStep, totalGenerations, epsilonProportion, epsilonMultiplier, nRuns (number of runs), nStepsPRC, numParticles, standardDevFactor.
priorList: List of prior distributions. This is used for doing post-analysis comparisons between prior and posterior distributions, such as with function plotPosteriors.
phy: Input phylogeny.
traits: Input traits.
trueFreeValuesANDSummaryValues: Parameter estimates and summary stats from all sims.
simTime: Processor time for simulations.
abcDistancesRaw: Euclidean distances for each simulation and free parameter.
particleDataFrame: DataFrame with information from each simulation, including generation, attempt, id, parentid, distance, weight, and parameter states.
postSummary: Summarizes the posterior distribution from the final generation for all free parameters, giving the mean, standard deviation and Highest Posterior Density (at a 0.8 alpha) for each parameter.
parMeansList: A list of parameter means for both fixed and unfixed parameters, sorted into a list of three vectors (starting, intrinsic, extrinsic).

Author(s)

Brian O'Meara and Barb Banbury

References

Sisson et al. 2007, Wegmann et al. 2009

Examples



set.seed(1)
data(simRunExample)

# NOTE: the example analyses below sample too few particles, 
    # over too few steps, with too few starting simulations
    # - all for the sake of examples that demonstrate these
    # within a reasonable time-frame

## Please set these values to more realistic levels for your analyses!

resultsPRC <- doRun_prc(
  phy = simPhyExample, 
  traits = simCharExample, 
  intrinsicFn = brownianIntrinsic, 
  extrinsicFn = nullExtrinsic, 
  startingPriorsFns = "normal", 
  startingPriorsValues = list(c(mean(simCharExample[, 1]), 
         sd(simCharExample[, 1]))), 
  intrinsicPriorsFns = c("exponential"), 
  intrinsicPriorsValues = list(10), 
  extrinsicPriorsFns = c("fixed"), 
  extrinsicPriorsValues = list(0), 
  generation.time = 10000, 
  nRuns = 2, 
  nStepsPRC = 3, 
  numParticles = 20, 
  nInitialSimsPerParam = 10, 
  jobName = "examplerun_prc", 
  stopRule = FALSE, 
  multicore = FALSE, 
  coreLimit = 1
  )

resultsPRC

#one should make sure priors are uniform with doRun_rej!

resultsRej <- doRun_rej(
    phy = simPhyExample, 
    traits = simCharExample, 
    intrinsicFn = brownianIntrinsic, 
    extrinsicFn = nullExtrinsic, 
    startingPriorsFns = "normal", 
    startingPriorsValues = list(c(mean(simCharExample[, 1]),
         sd(simCharExample[, 1]))), 
    intrinsicPriorsFns = c("exponential"), 
    intrinsicPriorsValues = list(10), 
    extrinsicPriorsFns = c("fixed"), 
    extrinsicPriorsValues = list(0), 
    generation.time = 10000, 
    jobName = "examplerun_rej", 
    abcTolerance = 0.05, 
    multicore = FALSE, 
    coreLimit = 1
    )

resultsRej

bomeara/treevo documentation built on Aug. 19, 2023, 6:52 p.m.