ratematrixJointMCMC: Estimate the evolutionary rate matrix together with the...

View source: R/ratematrixJointMCMC.R

ratematrixJointMCMCR Documentation

Estimate the evolutionary rate matrix together with the regimes using Markov-chain Monte Carlo

Description

Function runs a MCMC chain to estimate the posterior distribution of the evolutionary rate matrix (R) and the root value (phylogenetic mean). Prior distribution and starting state for the chain can be chosen among pre-defined options or manually set by the user using accompanying functions (see function 'makePrior' for more information). User NEED to provide a directory to write the files (See 'Details'). Use dir="." option to write files to the current directory or provide a name of a folder to be created.

Usage

ratematrixJointMCMC(
  data_BM,
  data_Mk,
  phy,
  prior_BM = "uniform_scaled",
  prior_Mk = "uniform",
  par_prior_Mk = c(0, 100),
  Mk_model = "SYM",
  root_Mk = "madfitz",
  smap_limit = 1e+06,
  start = "prior_sample",
  start_Q = NULL,
  gen = 1e+06,
  burn = 0.25,
  thin = 100,
  v = 50,
  w_sd = 0.2,
  w_q = 0.2,
  w_mu = 0.5,
  prop = c(0.05, 0.3, 0.3, 0.175, 0.175),
  dir = NULL,
  outname = "ratematrixJointMCMC",
  IDlen = 5,
  save.handle = TRUE
)

Arguments

data_BM

a matrix with the data. Species names need to be provided as rownames (rownames(data) == phy$tip.label). Each column is a different trait. Names for the columns are used as trait labels. If labels are not provided, the function will use default labels.

data_Mk

a named vector with the discrete data for the tips. Species names need to be provided as names for the vector. States are used to estimate the Markov model for the predictor regimes and perform stochastic mapping simulations.

phy

a phylogeny of the class "phylo". Here the stochastic maps will be simulated together with the MCMC estimation. The regimes will follow the data provided as 'data_Mk'.

prior_BM

the prior densities for the multivariate Brownian motion model. Must be one of "uniform", "uniform_scaled" (the default, see 'Details'), "empirical_mean", or the output of the "makePrior" function. See more information on 'makePrior' and in the examples below.

prior_Mk

the prior density for the Markov model for the predictor regimes. Must be one of "uniform" or "exponential" (default is "uniform").

par_prior_Mk

either a numeric vector with length 2 with the min and max for the uniform prior when 'prior_Mk = "uniform"' or a single value for the rate of the exponential distribution when 'prior_Mk = "exponential"'.

Mk_model

the Markov model fitted to the predictor regimes and to make the stochastic map simulations. One of "SYM" for symmetrical rates (default value), "ARD" for all rates different, or "ER" for equal reates.

root_Mk

the root type for the Mk model for the predictor regimes. Can be one of "madfitz" (default) or "equal".

smap_limit

the maximum number of times that a stochastic map for a given branch can be attempted. If the simulation is not finished by this number of trials then the stochastic map is rejected. If set to '0' then there is no limit. The default value is 1e6.

start

the starting state for the MCMC chain. Must be one of "prior_sample" (the default), "mle", or a sample from the prior generated with the "samplePrior" functions.

start_Q

A matrix with the starting state for the Q matrix. Default is 'NULL' and the Q matrix is sampled from its prior distribution.

gen

number of generations for the chain.

burn

the fraction of the chain for the burnin (not written to file). A numeric value between 0 and 1 (i.e., 0 means no burnin). The final number of posterior samples is equal to "(gen*burn)/thin".

thin

the number of generations to be skipped between each sample of the posterior. The final number of posterior samples is equal to "(gen*burn)/thin".

v

value for the degrees of freedom parameter of the inverse-Wishart proposal distribution for the correlation matrix. Smaller values provide larger steps and larger values provide smaller steps. (Yes, it is counterintuitive.) This needs to be a single value applied to all regimes or a vector with the same length as the number of regimes.

w_sd

the multiplying factor for the multiplier proposal on the vector of standard deviations. This can be a single value to be used for the sd of all traits for all regimes or a matrix with number of columns equal to the number of regimes and number of rows equal to the number of traits. If a matrix, then each element will be used to control the correspondent width of the standard deviation.

w_q

the multiplying factor for the multiplier proposal on the transition matrix for the Markov model fitted to the predictor traits. Need to be a single value.

w_mu

value for the width of the sliding window proposal for the vector of root values (phylogenetic mean). This can be a single value to be used for the root value of all traits or a vector of length equal to the number of traits. If a vector, then each element will be used as the width of the proposal distribution for each trait in the same order as the columns in 'data'. When 'prior="uniform_scaled"' (the default) this parameter is computed from the data.

prop

a numeric vector of length 5 with the proposal frequencies for each parameter of the model. The vector need to sum to 1. Values are i this order: phylogenetic mean (prop[1]), standard deviations (prop[2]), correlation matrix (prop[3]), transition matrix (Mk) (prop[4]), and stochastic maps (prop[5]). Default value is 'c(0.050, 0.300, 0.300, 0.175, 0.175)'.

dir

path of the directory to write the files. Has no default value (due to RCran policy). The path can be provided both as relative or absolute. It should accept Linux, Mac and Windows path formats.

outname

name for the MCMC chain (default is 'ratematrixMCMC'). Name will be used in all the files alongside a unique ID of numbers with length of 'IDlen'.

IDlen

length of digits of the numeric identifier used to name output files (default is 5).

save.handle

whether the handle for the MCMC should be saved to the directory in addition to the output files.

Details

The MCMC chain works by proposing values for the evolutionary rate matrices (R) fitted to the tree and the vector of root values (or phylogenetic mean). The proposal for the R matrices works by separating the variance-covariance matrix into a correlation matrix and a vector of standard deviations and making independent proposals for each. This scheme is called the 'separation strategy' and significantly improves the mix of the chain and also provide a intuitive distinction between the evolutionary correlation among the traits (correlation matrix) and the rates of evolution (standard deviation vector). The proposal for the root values are made all in a single step.

The function will print a series of messages to the screen. Those provide details of the setup of the chain, the unique identifier for the files and the log-likelihood of the starting value of the chain. Up to now these messages cannot be disabled.

DIRECTORY TO WRITE FILES: User need to specify the directory to write the files. Use "." to write to the current directoy. Or, for example, use "MCMC_files" to create the folder "MCMC_files" in the current directory. RCran policy prohibits the package to automaticaly write to the current directory.

DEFAULT PRIOR: The default prior distribution ('uniform_scaled') is composed by a uniform distribution on the root values with range equal to the range observed at the tip data. The size of the window used at each proposal step for the root values is equal to the width of the prior divided by 10 units. For the evolutionary rate matrix, this prior sets a uniform distribution on the correlations (spanning all possible correlation structures) and also a uniform distribution on the vector of standard deviations. The limits of the prior on the standard deviation is computed by first doing a quick Maximum Likelihood estimate of each trait under a single rate BM model and using the results to inform the magnitude of the rates. This default prior distribution might not be the best for your dataset. Keep in mind that the default behavior of the MCMC is to draw a starting point from the prior distribution. Please check the 'makePrior' function for more information on priors and how to make a custom prior distribution.

SAMPLE OF TREES: The MCMC chain can integrate the phylogenetic uncertainty or the uncertainty in the rate regimes by randomly sampling a phylogenetic tree from a list of trees. To activate this option, provide a list of 'simmap' or 'phylo' trees as the 'phy' argument. The MCMC will randomly sample a tree each proposal step. Check the 'logAnalyzer' function for more information.

MCMC DOES NOT START: It is possible that the starting point shows a very low likelihood value, resulting in the collapse of the chain. This might be a result of a random sample from a very unlikely region of the prior. We suggest that another sample of the prior is taken or that the user make a more suitable prior using the function 'makePrior'.

MCMC DOES NOT CONVERGE OR MIX: If the MCMC is taking too long to converge then the parameters of the chain might not be good for your data. First check the 'logAnalyzer' function as well as the 'computeESS'. The recommended acceptance ratio is ~ 0.24, if it is too high, then the step size of the proposals might be too small, try increasing the step size. In contrast, low acceptance ratio might be due to step sizes too large. Try to decrease the size of the steps. If the effective sample size (ESS) for the chain (see 'checkConvergence' and 'computeESS' functions) is to low for some parameter, then try to increase the proportion of times that the parameter is proposed in the MCMC.

CANNOT FIND THE POSTERIOR: The function writes the posterior into two files: The '.log' file has the log-likelihood and information about which phylogeny was used, which parameter was proposed and whether the step was accepted or not. The '.mcmc' file has the posterior for the parameters of the model. Those are identified by a name for the chain set by "outname" and an unique series of numbers set by "IDlen". Note that you will need the handle object provided as the output for the function (or saved to the directory if 'save.handle' is TRUE) to be able to load, plot and analyze the posterior distribution.

Value

Function returns the 'handle' object and writes the posterior distribution and log as files in the directory (see 'dir'). The handle is a list with the details of the MCMC chain. It is composed by: *k* the number of traits; *p* the number of R regimes fitted to the tree; *ID* the unique identifier of the run; *dir* the directory where the posterior and log files were saved; *outname* the name for the chain; *trait.names* a vector with the label for the traits; *regime.names* a vector with the label for the rate regimes; *data* the data used in the analysis; *phy* a single phylogeny or the list of phylogenies; *prior* a list with the prior functions; *start* a list with the starting parameters for the chain; *gen* the number of generations for the chain; *mcmc.par* a list with the tunning parameters for the MCMC.

Author(s)

Daniel S. Caetano and Luke J. Harmon

References

Revell, L. J., and L. J. Harmon. 2008. Testing quantitative genetic hypotheses about the evolutionary rate matrix for continuous characters. Evolutionary Ecology Research 10:311.

Revell, L. J., and D. C. Collar. 2009. Phylogenetic Analysis of the Evolutionary Correlation Using Likelihood. Evolution 63:1090–1100.

Caetano, D. S., and L. J. Harmon. 2017. ratematrix: An R package for studying evolutionary integration among several traits on phylogenetic trees. Methods in Ecology and Evolution 8:1920–1927.

Caetano, D. S., and L. J. Harmon. 2018. Estimating Correlated Rates of Trait Evolution with Uncertainty. Systematic Biology, doi: 10.1093/sysbio/syy067.


Caetanods/ratematrix documentation built on Jan. 4, 2023, 3:12 p.m.