fit_musse | R Documentation |
The Binary State Speciation and Extinction (BiSSE) model (Maddison et al. 2007) and its extension to Multiple State Speciation Extinction (MuSSE) models (FitzJohn et al. 2009, 2012), Hidden State Speciation Extinction (HiSSE) models (Beaulieu and O'meara, 2016) or Several Examined and Concealed States-dependent Speciation and Extinction (SecSSE) models (van Els et al. 2018), describe a Poissonian cladogenic process whose birth/death (speciation/extinction) rates depend on the states of an evolving discrete trait. Specifically, extant tips either go extinct or split continuously in time at Poissonian rates, and birth/death rates at each extant tip depend on the current state of the tip; lineages tansition stochastically between states acccording to a continuous-time Markov process with fixed transition rates. At the end of the simulation (i.e., at "present-day"), extant lineages are sampled according to some state-dependent probability ("sampling_fraction"), which may depend on proxy state. Optionally, tips may also be sampled continuously over time according to some Poissonian rate (which may depend on proxy state), in which case the resulting tree may not be ultrametric.
This function takes as main input a phylogenetic tree (ultrametric unless Poissonian sampling is included) and a list of tip proxy states, and fits the parameters of a BiSSE/MuSSE/HiSSE/SecSSE model to the data via maximum-likelihood. Tips can have missing (unknown) proxy states, and the function can account for biases in species sampling and biases in the identification of proxy states. The likelihood is calculated using a mathematically equivalent, but computationally more efficient variant, of the classical postorder-traversal BiSSE/MuSSE/HiSSE/SecSSE algorithm, as described by Louca (2019). This function has been optimized for large phylogenetic trees, with a relatively small number of states (i.e. Nstates<<Ntips); its time complexity scales roughly linearly with Ntips.
If you use this function for your research please cite Louca and Pennell (2020), DOI:10.1093/sysbio/syz055.
fit_musse(tree,
Nstates,
NPstates = NULL,
proxy_map = NULL,
state_names = NULL,
tip_pstates = NULL,
tip_priors = NULL,
sampling_fractions = 1,
reveal_fractions = 1,
sampling_rates = 0,
transition_rate_model = "ARD",
birth_rate_model = "ARD",
death_rate_model = "ARD",
transition_matrix = NULL,
birth_rates = NULL,
death_rates = NULL,
first_guess = NULL,
lower = NULL,
upper = NULL,
root_prior = "auto",
root_conditioning = "auto",
oldest_age = NULL,
Ntrials = 1,
Nscouts = NULL,
optim_algorithm = "nlminb",
optim_max_iterations = 10000,
optim_max_evaluations = NULL,
optim_rel_tol = 1e-6,
check_input = TRUE,
include_ancestral_likelihoods = FALSE,
Nthreads = 1,
Nbootstraps = 0,
Ntrials_per_bootstrap = NULL,
max_condition_number = 1e4,
relative_ODE_step = 0.1,
E_value_step = 1e-4,
D_temporal_resolution = 100,
max_model_runtime = NULL,
verbose = TRUE,
diagnostics = FALSE,
verbose_prefix = "")
tree |
Phylogenetic tree of class "phylo", representing the evolutionary relationships between sampled species/lineages. Unless Poissonian sampling is included in the model (option |
Nstates |
Integer, specifying the number of possible discrete states a tip can have, influencing speciation/extinction rates. For example, if |
NPstates |
Integer, optionally specifying a number of "proxy-states" that are observed instead of the underlying speciation/extinction-modulating states. To fit a HiSSE/SecSSE model, |
proxy_map |
Integer vector of size |
state_names |
Optional character vector of size |
tip_pstates |
Integer vector of size Ntips, listing the proxy state at each tip, in the same order as tips are indexed in the tree. The vector may (but need not) include names; if it does, these are checked for consistency with the tree (if |
tip_priors |
Numeric matrix of size Ntips x |
sampling_fractions |
Numeric vector of size |
reveal_fractions |
Numeric vector of size |
sampling_rates |
Numeric vector of size |
transition_rate_model |
Either a character or a 2D integer matrix of size Nstates x Nstates, specifying the model for the transition rates between states. This option controls the parametric complexity of the state transition model, i.e. the number of independent rates and the correspondence between independent and dependent rates. If a character, then it must be one of "ER", "SYM", "ARD", "SUEDE" or "SRD", as used for Mk models (see the function If an integer matrix, then it defines a custom parametric structure for the transition rates, by mapping entries of the transition matrix to a set of independent transition-rate parameters (numbered 1,2, and so on), similarly to the option |
birth_rate_model |
Either a character or an integer vector of length Nstates, specifying the model for the various birth (speciation) rates. This option controls the parametric complexity of the possible birth rates, i.e. the number of independent birth rates and the correspondence between independent and dependent birth rates. If a character, then it must be either "ER" (equal rates) or "ARD" (all rates different). If an integer vector, it must map each state to an indepedent birth-rate parameter (indexed 1,2,..). For example, the vector |
death_rate_model |
Either a character or an integer vector of length Nstates, specifying the model for the various death (extinction) rates. Similar to |
transition_matrix |
Either |
birth_rates |
Either |
death_rates |
Either |
first_guess |
Either
Start values are only relevant for fitted (i.e., non-fixed) parameters. |
lower |
Either |
upper |
Either |
root_prior |
Either a character or a numeric vector of size Nstates, specifying the prior probabilities of states for the root, i.e. the weights for obtaining a single model likelihood by averaging the root's state likelihoods. If a character, then it must be one of "flat", "empirical", "likelihoods", "max_likelihood" or "auto". "empirical" means the root's prior is set to the proportions of (estimated) extant species in each state (correcting for sampling fractions and reveal fractions, if applicable). "likelihoods" means that the computed state-likelihoods of the root are used, after normalizing to obtain a probability distribution; this is the approach used in the package |
root_conditioning |
Character, specifying an optional modification to be applied to the root's state likelihoods prior to averaging. Can be "none" (no modification), "madfitz", "herr_als", "crown" or "stem". "madfitz" and "herr_als" (after van Els, Etiene and Herrera-Alsina 2018) are the options implemented in the package |
oldest_age |
Strictly positive numeric, specifying the oldest age (time before present) to consider for fitting. If this is smaller than the tree's root age, then the tree is split into multiple subtrees at |
Ntrials |
Non-negative integer, specifying the number of trials for fitting the model, using alternative (randomized) starting parameters at each trial. A larger |
Nscouts |
Optional positive integer, number of randomly chosen starting points to consider for all fitting trials except the first one. Among all "scouted" starting points, the |
optim_algorithm |
Character, specifying the optimization algorithm for fitting. Must be one of either "optim", "nlminb" or "subplex" (requires the |
optim_max_iterations |
Integer, maximum number of iterations allowed for fitting. Only relevant for "optim" and "nlminb". |
optim_max_evaluations |
Integer, maximum number of function evaluations allowed for fitting. Only relevant for "nlminb" and "subplex" (requires the |
optim_rel_tol |
Numeric, relative tolerance for the fitted log-likelihood. |
check_input |
Logical, specifying whether to check the validity of input variables. If you are certain that all input variables are valid, you can set this to |
include_ancestral_likelihoods |
Logical, specifying whether to include the state likelihoods for each node, in the returned variables. These are the āDā variables calculated as part of the likelihood based on the subtree descending from each node, and may be used for "local" ancestral state reconstructions. |
Nthreads |
Integer, specifying the number of threads for running multiple fitting trials in parallel. Only relevant if |
Nbootstraps |
Integer, specifying an optional number of bootstrap samplings to perform, for estimating standard errors and confidence intervals of maximum-likelihood fitted parameters. If 0, no bootstrapping is performed. Typical values are 10-100. At each bootstrap sampling, a simulation of the fitted MuSSE/HiSSE model is performed, the parameters are estimated anew based on the simulation, and subsequently compared to the original fitted parameters. Each bootstrap sampling will thus use roughly as many computational resources as the original maximum-likelihood fit (e.g., same number of trials, same optimization parameters etc). |
Ntrials_per_bootstrap |
Integer, specifying the number of fitting trials to perform for each bootstrap sampling. If |
max_condition_number |
Positive unitless number, specifying the maximum permissible condition number for the "G" matrix computed for the log-likelihood. A higher condition number leads to faster computation (roughly on a log-scale) especially for large trees, at the potential expense of lower accuracy. Typical values are 1e2-1e5. See Louca (2019) for further details on the condition number of the G matrix. |
relative_ODE_step |
Positive unitless number, specifying the default relative time step for the ordinary differential equation solvers. |
E_value_step |
Positive unitless number, specifying the relative difference between subsequent recorded and interpolated E-values, in the ODE solver for the extinction probabilities E (Louca 2019). Typical values are 1e-2 to 1e-5. A smaller |
D_temporal_resolution |
Positive unitless number, specifying the relative resolution for interpolating G-map over time (Louca 2019). This is relative to the typical time scales at which G-map varies. For example, a resolution of 10 means that within a typical time scale there will be 10 interpolation points. Typical values are 1-1000. A greater resolution increases interpolation accuracy, but also increases memory requirements and adds runtime (scaling with the tree's age span, not Ntips). |
max_model_runtime |
Numeric, optional maximum number of seconds for evaluating the likelihood of a model, prior to cancelling the calculation and returning Inf. This may be useful if extreme model parameters (e.g., reached transiently during fitting) require excessive calculation time. Parameters for which the calculation of the likelihood exceed this threshold, will be considered invalid and thus avoided during fitting. For example, for trees with 1000 tips a time limit of 10 seconds may be reasonable. If 0, no time limit is imposed. |
verbose |
Logical, specifying whether to print progress reports and warnings to the screen. In any case, fatal errors are always reported. |
diagnostics |
Logical, specifying whether to print detailed information (such as model likelihoods) at every iteration of the fitting routine. For debugging purposes mainly. |
verbose_prefix |
Character, specifying the line prefix for printing progress reports, warnings and errors to the screen. |
HiSSE/SecSSE models include two discrete traits, one trait that defines the rate categories of diversification rates (as in BiSSE/MuSSE), and one trait that does not itself influence diversification but whose states (here called "proxy states") each represent one or more of the diversity-modulating states. HiSSE models (Beaulieu and O'meara, 2016) and SecSSE models (van Els et al., 2018) are closely related to BiSSE/MuSSE models, the main difference being the fact that the actual diversification-modulating states are not directly observed. In essence, a HiSSE/SecSSE model is a BiSSE/MuSSE model, where the final tip states are replaced by their proxy states, thus "masking" the underlying diversity-modulating trait. This function is able to fit HiSSE/SecSSE models with appropriate choice of the input variables Nstates
, NPstates
and proxy_map
.
Note that the terminology and setup of HiSSE/SecSSE models followed here differs from their description in the original papers by Beaulieu and O'meara (2016) and van Els et al. (2018), in order to achieve what we think is a more intuitive unification of BiSSE/MuSSE/HiSSE/SecSSE. For ease of terminology, when considering a BiSSE/MuSSE model, here we use the terms "states" and "proxy-states" interchangeably, since under BiSSE/MuSSE the proxy trait can be considered identical to the diversification-modulating trait. A distinction between "states" and "proxy-states" is only relevant for HiSSE/SecSSE models.
As an example of a HiSSE model, Nstates=4
, NPstates=2
and proxy_map=c(1,2,1,2)
specifies that states 1 and 3 are represented by proxy-state 1, and states 2 and 4 are represented by proxy-state 2. This is the original case described by Beaulieu and O'Meara (2016); in their terminology, there would be 2 "hidden"" states ("0" and "1") and 2 "observed" states ("A" and "B"), and the 4 diversification rate categories (Nstates=4
) would be called "0A", "1A", "0B" and "1B". The somewhat different terminology used here allows for easier generalization to an arbitrary number of diversification-modulating states and an arbitrary number of proxy states. For example, if there are 6 diversification modulating states, represented by 3 proxy-states as 1->A, 2->A, 3->B, 4->C, 5->C, 6->C, then one would set Nstates=6
, NPstates=3
and proxy_map=c(1,1,2,3,3,3)
.
The run time of this function scales asymptotically linearly with tree size (Ntips), although run times can vary substantially depending on model parameters. As a rule of thumb, the higher the birth/death/transition rates are compared to the tree's overall time span, the slower the calculation becomes.
The following arguments control the tradeoff between accuracy and computational efficiency:
max_condition_number
: A smaller value means greater accuracy, at longer runtime and more memory.
relative_ODE_step
: A smaller value means greater accuracy, at longer runtime.
E_value_step
: A smaller value means greater accuracy, at longer runtime and more memory.
D_temporal_resolution
: A greater value means greater accuracy, at longer runtime and more memory.
Typically, the default values for these arguments should be fine. For smaller trees, where cladogenic and sampling stochasticity is the main source of uncertainty, these parameters can probably be made less stringent (i.e., leading to lower accuracy and faster computation), but then again for small trees computational efficiency may not be an issue anyway.
Note that the old option max_start_attempts
has been removed. Consider using Nscouts
instead.
A named list with the following elements:
success |
Logical, indicating whether the fitting was successful. If |
Nstates |
Integer, the number of states assumed for the model. |
NPstates |
Integer, the number of proxy states assumed for the model. Note that in the case of a BiSSE/MuSSE model, this will be the same as |
root_prior |
Character, or numeric vector of length Nstates, specifying the root prior used. |
parameters |
Named list containing the final maximum-likelihood fitted model parameters. If
|
start_parameters |
Named list containing the default start parameter values for the fitting. Structured similarly to |
loglikelihood |
Numeric, the maximized log-likelihood of the model, if fitting succeeded. |
AIC |
Numeric, the Akaike Information Criterion for the fitted model, defined as |
Niterations |
The number of iterations needed for the best fit. Only relevant if the optimization method was "optim" or "nlminb". |
Nevaluations |
Integer, the number of function evaluations needed for the best fit. Only relevant if the optimization method was "nlminb" or "subplex". |
converged |
Logical, indicating whether convergence was successful during fitting. If convergence was not achieved, and the fitting was stopped due to one of the stopping criteria |
warnings |
Character vector, listing any warnings encountered during evaluation of the likelihood function at the fitted parameter values. For example, this vector may contain warnings regarding the differential equation solvers or regarding the rank of the G-matrix (Louca, 2019). |
subroots |
Integer vector, listing indices of tips/nodes in the tree that were considered as starting points of independent MuSSE processes. If |
ML_subroot_states |
Integer vector, with values between 1 and Nstates, giving the maximum-likelihood estimate of each subroot's state. |
ML_substem_states |
Integer vector, with values between 1 and Nstates, giving the maximum-likelihood estimate of the state at each subroot's stem (i.e., exactly at |
trial_start_loglikelihoods |
Numeric vector of length |
trial_loglikelihoods |
Numeric vector of length |
trial_Niterations |
Integer vector of length |
trial_Nevaluations |
Integer vector of length |
standard_errors |
Named list containing the elements "transition_matrix" (numeric matrix of size Nstates x Nstates), "birth_rates" (numeric vector of size Nstates) and "death_rates" (numeric vector of size Nstates), listing standard errors of all model parameters estimated using parametric bootstrapping. Only included if |
CI50lower |
Named list containing the elements "transition_matrix" (numeric matrix of size Nstates x Nstates), "birth_rates" (numeric vector of size Nstates) and "death_rates" (numeric vector of size Nstates), listing the lower end of the 50% confidence interval (i.e. the 25% quantile) for each model parameter, estimated using parametric bootstrapping. Only included if |
CI50upper |
Similar to |
CI95lower |
Similar to |
CI95upper |
Similar to |
CI |
2D numeric matrix, listing maximum-likelihood estimates, standard errors and confidence intervals for all model parameters (one row per parameter, one column for ML-estimates, one column for standard errors, two columns per confidence interval). Standard errors and confidence intervals are as estimated using parametric bootstrapping. This matrix contains the same information as |
ancestral_likelihoods |
2D matrix of size Nnodes x Nstates, listing the computed state-likelihoods for each node in the tree. These may be used for "local" ancestral state reconstructions, based on the information contained in the subtree descending from each node. Note that for each node the ancestral likelihoods have been normalized for numerical reasons, however they should not be interpreted as actual probabilities. For each node n and state s, |
Stilianos Louca
W. P. Maddison, P. E. Midford, S. P. Otto (2007). Estimating a binary character's effect on speciation and extinction. Systematic Biology. 56:701-710.
R. G. FitzJohn, W. P. Maddison, S. P. Otto (2009). Estimating trait-dependent speciation and extinction rates from incompletely resolved phylogenies. Systematic Biology. 58:595-611
R. G. FitzJohn (2012). Diversitree: comparative phylogenetic analyses of diversification in R. Methods in Ecology and Evolution. 3:1084-1092
J. M. Beaulieu and B. C. O'Meara (2016). Detecting hidden diversification shifts in models of trait-dependent speciation and extinction. Systematic Biology. 65:583-601.
D. Kuehnert, T. Stadler, T. G. Vaughan, A. J. Drummond (2016). Phylodynamics with migration: A computational framework to quantify population structure from genomic data. Molecular Biology and Evolution. 33:2102-2116.
P. van Els, R. S. Etiene, L. Herrera-Alsina (2018). Detecting the dependence of diversification on multiple traits from phylogenetic trees and trait data. Systematic Biology. syy057.
S. Louca and M. W. Pennell (2020). A general and efficient algorithm for the likelihood of diversification and discrete-trait evolutionary models. Systematic Biology. 69:545-556. DOI:10.1093/sysbio/syz055
simulate_dsse
,
asr_mk_model
,
fit_tree_model
# EXAMPLE 1: BiSSE model
# - - - - - - - - - - - - - -
# Choose random BiSSE model parameters
Nstates = 2
Q = get_random_mk_transition_matrix(Nstates, rate_model="ARD", max_rate=0.1)
parameters = list(birth_rates = runif(Nstates,5,10),
death_rates = runif(Nstates,0,5),
transition_matrix = Q)
rarefaction = 0.5 # randomly omit half of the tips
# Simulate a tree under the BiSSE model
simulation = simulate_musse(Nstates,
parameters = parameters,
max_tips = 1000,
sampling_fractions = rarefaction)
tree = simulation$tree
tip_states = simulation$tip_states
## Not run:
# fit BiSSE model to tree & tip data
fit = fit_musse(tree,
Nstates = Nstates,
tip_pstates = tip_states,
sampling_fractions = rarefaction)
if(!fit$success){
cat(sprintf("ERROR: Fitting failed"))
}else{
# compare fitted birth rates to true values
errors = (fit$parameters$birth_rates - parameters$birth_rates)
relative_errors = errors/parameters$birth_rates
cat(sprintf("BiSSE relative birth-rate errors:\n"))
print(relative_errors)
}
## End(Not run)
# EXAMPLE 2: HiSSE model, with bootstrapping
# - - - - - - - - - - - - - -
# Choose random HiSSE model parameters
Nstates = 4
NPstates = 2
Q = get_random_mk_transition_matrix(Nstates, rate_model="ARD", max_rate=0.1)
rarefaction = 0.5 # randomly omit half of the tips
parameters = list(birth_rates = runif(Nstates,5,10),
death_rates = runif(Nstates,0,5),
transition_matrix = Q)
# reveal the state of 30% & 60% of tips (in state 1 & 2, respectively)
reveal_fractions = c(0.3,0.6)
# use proxy map corresponding to Beaulieu and O'Meara (2016)
proxy_map = c(1,2,1,2)
# Simulate a tree under the HiSSE model
simulation = simulate_musse(Nstates,
NPstates = NPstates,
proxy_map = proxy_map,
parameters = parameters,
max_tips = 1000,
sampling_fractions = rarefaction,
reveal_fractions = reveal_fractions)
tree = simulation$tree
tip_states = simulation$tip_proxy_states
## Not run:
# fit HiSSE model to tree & tip data
# run multiple trials to ensure global optimum
# also estimate confidence intervals via bootstrapping
fit = fit_musse(tree,
Nstates = Nstates,
NPstates = NPstates,
proxy_map = proxy_map,
tip_pstates = tip_states,
sampling_fractions = rarefaction,
reveal_fractions = reveal_fractions,
Ntrials = 5,
Nbootstraps = 10,
max_model_runtime = 0.1)
if(!fit$success){
cat(sprintf("ERROR: Fitting failed"))
}else{
# compare fitted birth rates to true values
errors = (fit$parameters$birth_rates - parameters$birth_rates)
relative_errors = errors/parameters$birth_rates
cat(sprintf("HiSSE relative birth-rate errors:\n"))
print(relative_errors)
# print 95%-confidence interval for first birth rate
cat(sprintf("CI95 for lambda1: %g-%g",
fit$CI95lower$birth_rates[1],
fit$CI95upper$birth_rates[1]))
}
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.