input.evorates: Prepare input data for an Evolving Rates model

View source: R/FIT_main.R

input.evoratesR Documentation

Prepare input data for an Evolving Rates model

Description

This function processes tree and trait data in preparation for fitting an evorates model. Also takes additional model inputs (e.g., priors), though these may be overwritten later by run.evorates().

Usage

input.evorates(
  tree,
  trait.data,
  trait.se = NULL,
  constrain.Rsig2 = FALSE,
  trend = FALSE,
  lik.power = 1,
  sampling.scale = FALSE,
  ...
)

Arguments

tree

An object of class "phylo"

trait.data

Three options:

  • A named vector of trait values.

  • A rownamed matrix of trait values. For the moment, must be 1 column since multivariate models are not yet supported.

  • A data.frame with 2 columns: 1 with numeric data which is interpreted as trait values and 1 with string/factor data interpreted as names. If more than 1 of either kind of column is found, returns error. Will also use rownames if no string/factor column is found, but this limits data to consist of only 0-1 observations per tip.

Both multiple observations for a single tip and missing observations are allowed. In all cases, the associated names must generally match the tip labels found in tree (tree$tip.label) exactly. The only exception are numbers corresponding to internal node indices (these can be viewed by running: plot(tree); nodelabels()), which are used to assign trait values to internal nodes. As of now, the function actually completely ignores tree$node.label–I should probably fix that and plan to in the future. Be aware that this feature slightly alters tree because it works by grafting extra tips of length 0 to the appropriate nodes.

trait.se

A vector, matrix, or data.frame of trait value standard errors which must be unambiguously labeled (see trait.data). Alternatively, a single, unlabeled number that will be applied to all tips in tree (e.g., you could set it to 0 to specify all trait values are known without error). If NULL (the default), standard error is estimated for all tips in the tree while fitting the model. There are several things to note here:

  • Unlike trait.data, there can only be 1 trait standard error per tip.

  • You can use NA to specify tips that you want to estimate standard error for and Inf to specify tips that have missing trait values.

  • Generally, tips with multiple observations and no observations are automatically assigned standard errors of NA and Inf, respectively.

  • Any tips with unspecified standard error default to NA.

Any conflicting/impossible standard error specifications are corrected and return warnings.

constrain.Rsig2

TRUE or FALSE: should the rate varinace (R_sig2) parameter be constrained to 0, resulting in a simple Brownian Motion or early/late burst model? Defaults to FALSE. See Details for a definition of model parameters.

trend

TRUE or FALSE: should a trend in rates over time (R_mu) be estimated, resulting in an early/late burst or trended evorates model? Defaults to FALSE.

lik.power

A single number between 0 and 1 specifying what power to raise the likelihood function to. This is useful for sampling from "power posteriors" that shift the posterior to look more like the prior (indeed, you can set this to 0 to sample from the prior itself). Useful for model diagnostics and calculating things like Bayes Factors. Technically, you can set lik.power above 1 (a technique called "data cloning") but this is not beneficial for evorates models and we don't recommend it.

sampling.scale

TRUE or FALSE: should provided prior parameters (see ...) be interpreted on the raw scale of the data or on the transformed scale? All data passed to the Stan-based HMC sampler are transformed such that tree's total height is 1 and the variance of the trait data is 1. Defaults to FALSE, such that prior parameters are interpreted on the untransformed scale.

...

Prior arguments (see details for further information on what parameters mean):

  • In general, prior standard deviations and, in some cases, means, can be tweaked by passing arguments named "<parameter_name>_sd" and "<parameter_name>_mean", respectively. For example: R_mu_mean = 1 or R_sig2_sd = 1.

  • Prior on rate variance (R_sig2): Follows a half-Cauchy distribution (basically a half-normal distribution with extremely fat tails) with adjustable standard deviation. By default, standard deviation is set to 5 divided by the maximum height of tree (i.e., 5 on the sampling scale).

  • Prior on trend (R_mu): Follows a normal distribution with adjustable mean and standard deviation. By default, mean is set to 0 (also 0 on sampling scale), and standard deviation is set to 10 divided by the maximum height of tree (i.e., 10 on the sampling scale).

  • Prior on rate at the root (R_0): Follows a normal distribution with adjustable mean and standard deviation. By default, mean is set to variance of trait.data divided by the maximum height of tree on the natural log scale (i.e., 0 on sampling scale), and standard deviation is set to 10 (also 10 on sampling scale).

  • Prior on tip error variance (Y_sig2): Follows a truncated t distribution (basically a normal distribution with potentially fatter tails) with adjustable mean, standard deviation, and degrees of freedom. The distribution is truncated at 0 to the left such that Y_sig2 estimates are always positive (note that setting the mean to 0 results in a half-t distribution). By default, mean is set to 0, standard deviation is set to 2 times the variance of trait.data (i.e., 2 on the sampling scale), and degrees of freedom is set to 1. The degrees of freedom can be tweaked with the argument "Ysig2_df" and may be any positive number, including Inf, with lower numbers leading to a less informative priors with fatter tails. Notably, setting "Ysig2_df" to Inf makes this prior a truncated normal and much more informative!

Value

A list of call, mainly containing information on the inputted tree and trait data, trans.const, containing transformation constants used to scale data (see sampling.scale), and dat, containing the processed data that will be passed to directly to the Stan-based Hamiltonian Monte Carlo sampler.

#' @details Parameter definitions:

  • Rate variance (R_sig2, also denoted \sigma^2_{\sigma^2}): Determines how much random variation accumulates in rates over time. Specifically, independent lineages evolving for a length of time t would exhibit log-normally distributed rates with standard deviation sqrt(t * R_sig2). Another way to think about this is that the 95% credible interval for fold-changes in rates over 1 unit of time is given by exp(c(-1,1) * 1.96 * sqrt(R_sig2)), assuming R_mu = 0.

  • Trend (R_mu, also denoted \mu_{\sigma^2}): Determines whether median rates tend to decrease (if negative) or increase (if positive) over time. Specifically, the median fold-change in rate for a given lineage is exp(t * R_mu) over a length of time t. This is distinct from changes in average rates, which depends on both R_mu and R_sig2.

  • Note on combining rate variance and trend parameters: The 95% credible interval for fold-changes in rates over 1 unit of time, given a trend, is simply exp(R_mu) * exp(c(-1,1) * 1.96 * sqrt(R_sig2)) or equivalently exp(R_mu + c(-1,1) * 1.96 * sqrt(R_sig2)). This distribution of rate change is right-skewed due to the exp() function. Because of this, median changes in rates over time will always be lower than average changes. The "average change" parameter, denoted R_del or \delta_{\sigma^2}, is given by R_mu + R_sig2 / 2. Accordingly, the average fold-change in rate for a given lineage is exp(t * R_del) over a length of time t.

  • Tip error variance (Y_sig2, also denoted \sigma^2_y): Determines the among of "error" around observed trait values for tips without fixed standard errors. Specifically, raw observations for a given tip are sampled from a normal distribution centered at that tip's "true trait value" with standard deviation sqrt(Y_sig2).

  • Rate at the root (R_0, also denoted \sigma^2_0): The natural log of the rate at the root of the entire phylogeny.

  • Branchwise rates (R_i, also denoted ln \bar{\sigma^2_i}, where i is the index of an edge in tree): The natural log of the average rate along branches of the phylogeny. Note that rates are always shifting over time under this model, and these quantities are thus averages. The true rate value at any particular time point along a branch is a related, but separate, quantity. These will be NA for branches of length 0.

  • Background rate (bg_rate): The natural log of the average trait evolution rate for the entire phylogeny, given by the average of exp(R_i), with each entry weighted by its respective branch length. NAs are ignored in this calculation since they correspond to branches of length 0.

  • Branchwise rate deviations (Rdev_i, also denoted ln \bar{\sigma^2_{dev,i}}): Determines whether a branches exhibit relatively "fast" (if positive) or "slow" rates (if negative). Generally, these are the differences between the branchwise rates and geometric background rate on the natural log scale, though this depends on remove.trend). Here, the geometric background rate is defined as the weighted average of R_i, with weights corresponding to branch lengths. This helps prevent some issues with comparing highly right-skewed distributions. If remove.trend = TRUE, then branchwise rates are first "detrended" prior to calculating background rates and deviations (R_i - (-log(abs(R_mu)) - log(l_i) + log(abs(exp(R_mu * t1_i) - exp(R_mu * t0_i)))), where l is a vector of branch lengths and t0/t1 are vectors of start and end times of each branch). This basically make branchwise rate deviations determine whether branches exhibit slow/fast rates given the overall trend in rates through time. Otherwise, branchwise rate deviations will simmply indicate slow/fast branches occur at the root/tips of a tree in the presence of a strong trend.

See Also

fit.evorates is a convenient wrapper for input.evorates, run.evorates, and output.evorates

Examples

#get whale/dolphin tree/trait data
data("cet_fit")
tree <- cet_fit$call$tree
X <- cet_fit$call$trait.data

#prepare data for evorates model
input <- input.evorates(tree = tree, trait.data = trait.data)
#run the evorates model sampler (takes a couple minutes)
run <- run.evorates(input.evorates.obj = input, out.file = "test_fit", chains = 1)
#process the output from the object
fit <- output.evorates(run.evorates.obj = run)
#process the output from the files
fit <- output.evorates(run.evorates.obj = "test_fit")

#see fit.evorates() documentation for further examples



bstaggmartin/evorates documentation built on May 31, 2024, 5:56 a.m.