doRun: Run Approximate Bayesian Computation for Phylogenetic...

doRunR Documentation

Run Approximate Bayesian Computation for Phylogenetic Comparative Methods

Description

The doRun functions are the main interface for TreEvo users to do Approximate Bayesian Computation (ABC) analysis, effectively wrapping the simulateWithPriors functions to perform simulations, which themselves are wrappers for the doSimulation function. The two current doRun functions are doRun_prc which applies a partial rejection control (PRC) ABC analysis over multiple generations of simulations, and doRun_rej which performs a full rejection ('rej') ABC analysis.

Usage

doRun_prc(
  phy,
  traits,
  intrinsicFn,
  extrinsicFn,
  startingPriorsValues,
  startingPriorsFns,
  intrinsicPriorsValues,
  intrinsicPriorsFns,
  extrinsicPriorsValues,
  extrinsicPriorsFns,
  numParticles = 300,
  nStepsPRC = 5,
  nRuns = 2,
  nInitialSims = NULL,
  nInitialSimsPerParam = 100,
  generation.time = 1000,
  TreeYears = max(branching.times(phy)) * 1e+06,
  multicore = FALSE,
  coreLimit = NA,
  multicoreSuppress = FALSE,
  standardDevFactor = 0.2,
  epsilonProportion = 0.7,
  epsilonMultiplier = 0.7,
  validation = "CV",
  scale = TRUE,
  variance.cutoff = 95,
  stopRule = FALSE,
  stopValue = 0.05,
  maxAttempts = Inf,
  diagnosticPRCmode = FALSE,
  jobName = NA,
  saveData = FALSE,
  verboseParticles = TRUE
)

doRun_rej(
  phy,
  traits,
  intrinsicFn,
  extrinsicFn,
  startingPriorsValues,
  startingPriorsFns,
  intrinsicPriorsValues,
  intrinsicPriorsFns,
  extrinsicPriorsValues,
  extrinsicPriorsFns,
  generation.time = 1000,
  TreeYears = max(branching.times(phy)) * 1e+06,
  multicore = FALSE,
  coreLimit = NA,
  validation = "CV",
  scale = TRUE,
  nInitialSims = NULL,
  nInitialSimsPerParam = 100,
  variance.cutoff = 95,
  standardDevFactor = 0.2,
  jobName = NA,
  abcTolerance = 0.1,
  checkpointFile = NULL,
  checkpointFreq = 24,
  savesims = FALSE
)

Arguments

phy

A phylogenetic tree, in package ape's phylo format.

traits

Data matrix with rownames identical to phy@tip.label. If a vector, traits will be coerced to a matrix, with element names as rownames.

intrinsicFn

Name of (previously-defined) function that governs how traits evolve within a lineage, regardless of the trait values of other taxa.

extrinsicFn

Name of (previously-defined) function that governs how traits evolve within a lineage, based on their own ('internal') trait vlaue and the trait values of other taxa.

startingPriorsValues

A list of the same length as the number of prior distributions specified in startingPriorsFns (for starting values, this should be one prior function specified for each trait - thus one for most univariate trait analyses), with each element of the list a vector the same length as the appropriate number of parameters for that prior distribution (1 for "fixed", 2 for "uniform", 2 for "normal", 2 for "lognormal", 2 for "gamma", 1 for "exponential").

startingPriorsFns

Vector containing names of prior distributions to use for root states: can be one of "fixed", "uniform", "normal", "lognormal", "gamma", "exponential".

intrinsicPriorsValues

A list of the same length as the number of prior distributions specified in intrinsicPriorsFns (one prior function specified for each parameter in the intrinsic model), with each element of the list a vector the same length# as the appropriate number of parameters for that prior distribution (1 for "fixed", 2 for "uniform", 2 for "normal", 2 for "lognormal", 2 for "gamma", 1 for "exponential").

intrinsicPriorsFns

Vector containing names of prior distributions to use for intrinsic function parameters: can be one of "fixed", "uniform", "normal", "lognormal", "gamma", "exponential".

extrinsicPriorsValues

A list of the same length as the number of prior distributions specified in extrinsicPriorsFns (one prior function specified for each parameter in the extrinsic model), with each element of the list a vector the same length as the appropriate number of parameters for that prior distribution (1 for "fixed", 2 for "uniform", 2 for "normal", 2 for "lognormal", 2 for "gamma", 1 for "exponential").

extrinsicPriorsFns

Vector containing names of prior distributions to use for extrinsic function parameters: can be one of "fixed", "uniform", "normal", "lognormal", "gamma", "exponential".

numParticles

Number of accepted particles per PRC generation.

nStepsPRC

Number of PRC generations to run.

nRuns

Number of independent PRC runs to be performed, each consisting of independent sets of initial simulations and PRC generations. Note that runs are run sequentially, and not in parallel, as the generation of particles within each run makes use of the multicore functionality. If nRuns is greater than 1, the output from doRun_prc will be a list object composed of multiple output lists, as described.

nInitialSims

Number of initial simulations used to calibrate particle rejection control algorithm. If not given, this will be a function of the number of freely varying parameters.

nInitialSimsPerParam

If nInitialSims is not given by the user, the number of initial simulations performed to calibrate the particle rejection algorithm will instead be the number of free parameters in the model multiplied by nInitialSimsPerParam.

generation.time

The number of years per generation. This sets the coarseness of the simulation; if it's set to 1000, for example, the population's trait values change every 1000 calendar years. Note that this is in calendar years (see description for argument TreeYears), and not in millions of years (as is typical for dated trees in macroevolutionary studies). Thus, if a branch is 1 million-year time-unit long, and a user applies the default generation.time = 1000, then 1000 evolutionary changes will be simulated along that branch. See documentation for doSimulation for further details.

TreeYears

The amount of calendar time from the root to the furthest tip. Most trees in macroevolutionary studies are dated with branch lengths in units of millions of years, and thus the default for this argument is max(branching.times(phy)) * 1e6. If your tree has the most recent tip at time zero (i.e., the modern day), this would be the same as the root age of the tree. If your branch lengths are not in millions of years, you should alter this argument. Otherwise, leave this argument alone. See documentation for doSimulation for further details.

multicore

Whether to use multicore, default is FALSE. If TRUE, one of two suggested packages must be installed, either doMC (for UNIX systems) or doParallel (for Windows), which are used to activate multithreading. If neither package is installed, this function will fail if multicore = TRUE.

coreLimit

Maximum number of cores to be used.

multicoreSuppress

This argument suppresses the multicore code and will apply a plain vanilla serial for() loop instead of doPar. Mainly to be used for developer diagnostic purposes.

standardDevFactor

Standard deviation for mutating states each time a new particle is generated in a PRC generation.

epsilonProportion

Sets tolerance for initial simulations.

epsilonMultiplier

Sets tolerance on subsequent PRC generations.

validation

Character argument controlling what validation procedure is used by plsr. Default is "CV" for cross-validation.

scale

This argument is passed to plsr. It may be a numeric vector, or logical. If numeric vector, the input is scaled by dividing each variable with the corresponding element of scale. If scale = TRUE, the inpus is scaled by dividing each variable by its sample standard deviation. If cross-validation is selected (the default for returnPLSModel), scaling by the standard deviation is done for every segment.

variance.cutoff

Minimum threshold percentage of variance explained for the number of components included in the final PLS model fit. This value is a percentage and must be between 0 and 100. Default is 95 percent.

stopRule

If TRUE, an analysis will be terminated prior to set number of generations if the ratio of all parameters from one generation to the next falls below the stopValue.

stopValue

Threshold value for terminating an analysis prior to nStepesPRC.

maxAttempts

Maximum number attempts made in while loop within the PRC algorithm. If reached, the algorithm will terminate with an error message. By default, this is infinite, and thus there is no effective limit without user intervention.

diagnosticPRCmode

If TRUE (not the default), the function will be very noisy about characteristics of the PRC algorithm as it runs.

jobName

Optional job name.

saveData

Option to save various run information during the analysis, including summary statistics from analyses, output to external .Rdata and .txt files.

verboseParticles

If TRUE (the default), a large amount of information about parameter estimates and acceptance of particles is output to console via message as doRun_prc runs.

abcTolerance

Proportion of accepted simulations.

checkpointFile

Optional file name for checkpointing simulations.

checkpointFreq

Saving frequency for checkpointing.

savesims

Option to save individual simulations, output to a .Rdata file.

Details

Both doRun functions take an input phylogeny (phy), observed trait data set (traits), models (intrinsicFn, extrinsicFn), and priors (startingPriorsValues, startingPriorsFns, intrinsicPriorsValues, intrinsicPriorsFns, extrinsicPriorsValues, extrinsicPriorsFns). Pulling from the priors, it simulates an initial set of simulations (nInitialSims). This set of simulations is boxcox transformed , and a PLS regression (see methodsPLS) is performed for each free parameter to determine the most informative summary statistics (using variance.cutoff). The euclidean distance is calculated between each each initial simulation's most informative summary statistics and the input observed data.

Following that step, the approach of the two functions diverge drastically.

For function doRun_rej, those simulations whose most informative summary statistics fall below abcTolerance are kept as accepted 'particles' (simulations runs), describing the posterior distribution of parameters. No additional generations of simulations are performed.

Coversely, function doRun_prc performs an ABC-PRC analysis, which follows a much more complicated algorithm with additional generations. In PRC, tolerance is set based on epsilonProportion and epsilonMultiplier, and the analysis begins with generation 1 and continues through a number of generations equal to nStepsPRC. Single simulation runs ('particles') are accepted if the distance of the most informative summary stats to the original data is less than this tolerance. A generation is complete when enough particles (numParticles) have been accepted. These particles make up the distribution for the next generation. The accepted particles from the final generation describe the posterior distributions of parameters.

Value

The output of these two functions are lists, composed of multiple objects, which differ slightly in their content among the two functions. For doRun_prc, the output is:

input.data

Input variables: jobName, nTaxa (number of taxa), nInitialSims, generation.time, TreeYears, timeStep, totalGenerations, epsilonProportion, epsilonMultiplier, nRuns (number of runs), nStepsPRC, numParticles, standardDevFactor.

priorList

List of prior distributions. This is used for doing post-analysis comparisons between prior and posterior distributions, such as with function plotPosteriors.

particleDataFrame

DataFrame with information from each simulation, including generation, attempt, id, parentid, distance, weight, and parameter states.

toleranceVector

Final tolerance vector.

phy

Input phylogeny.

traits

Input traits.

simTime

Processor time for initial simulations.

time.per.gen

Processor time for subsequent generations.

postSummary

Summarizes the posterior distribution from the final generation for all free parameters, giving the mean, standard deviation and Highest Posterior Density (at a 0.8 alpha) for each parameter.

If nRuns is greater than 1, the output from doRun_prc will be a list object composed of multiple output lists, as described.

For doRun_rej, the output is:

input.data

Input variables: jobName, nTaxa (number of taxa), nInitialSims, generation.time, TreeYears, timeStep, totalGenerations, epsilonProportion, epsilonMultiplier, nRuns (number of runs), nStepsPRC, numParticles, standardDevFactor.

priorList

List of prior distributions. This is used for doing post-analysis comparisons between prior and posterior distributions, such as with function plotPosteriors.

phy

Input phylogeny.

traits

Input traits.

trueFreeValuesANDSummaryValues

Parameter estimates and summary stats from all sims.

simTime

Processor time for simulations.

abcDistancesRaw

Euclidean distances for each simulation and free parameter.

particleDataFrame

DataFrame with information from each simulation, including generation, attempt, id, parentid, distance, weight, and parameter states.

postSummary

Summarizes the posterior distribution from the final generation for all free parameters, giving the mean, standard deviation and Highest Posterior Density (at a 0.8 alpha) for each parameter.

parMeansList

A list of parameter means for both fixed and unfixed parameters, sorted into a list of three vectors (starting, intrinsic, extrinsic).

Author(s)

Brian O'Meara and Barb Banbury

References

Sisson et al. 2007, Wegmann et al. 2009

Examples



set.seed(1)
data(simRunExample)

# NOTE: the example analyses below sample too few particles, 
    # over too few steps, with too few starting simulations
    # - all for the sake of examples that demonstrate these
    # within a reasonable time-frame

## Please set these values to more realistic levels for your analyses!

resultsPRC <- doRun_prc(
  phy = simPhyExample, 
  traits = simCharExample, 
  intrinsicFn = brownianIntrinsic, 
  extrinsicFn = nullExtrinsic, 
  startingPriorsFns = "normal", 
  startingPriorsValues = list(c(mean(simCharExample[, 1]), 
         sd(simCharExample[, 1]))), 
  intrinsicPriorsFns = c("exponential"), 
  intrinsicPriorsValues = list(10), 
  extrinsicPriorsFns = c("fixed"), 
  extrinsicPriorsValues = list(0), 
  generation.time = 10000, 
  nRuns = 2, 
  nStepsPRC = 3, 
  numParticles = 20, 
  nInitialSimsPerParam = 10, 
  jobName = "examplerun_prc", 
  stopRule = FALSE, 
  multicore = FALSE, 
  coreLimit = 1
  )

resultsPRC

#one should make sure priors are uniform with doRun_rej!

resultsRej <- doRun_rej(
    phy = simPhyExample, 
    traits = simCharExample, 
    intrinsicFn = brownianIntrinsic, 
    extrinsicFn = nullExtrinsic, 
    startingPriorsFns = "normal", 
    startingPriorsValues = list(c(mean(simCharExample[, 1]),
         sd(simCharExample[, 1]))), 
    intrinsicPriorsFns = c("exponential"), 
    intrinsicPriorsValues = list(10), 
    extrinsicPriorsFns = c("fixed"), 
    extrinsicPriorsValues = list(0), 
    generation.time = 10000, 
    jobName = "examplerun_rej", 
    abcTolerance = 0.05, 
    multicore = FALSE, 
    coreLimit = 1
    )

resultsRej



bomeara/treevo documentation built on Aug. 19, 2023, 6:52 p.m.