| prepare_trait_data | R Documentation |
Map trait evolution on a time-calibrated phylogeny in several steps:
Step 1: Fit evolutionary models to trait data using Maximum Likelihood.
Step 2: Select the best fitting model comparing AICc.
Step 3: Infer ancestral characters estimates (ACE) at nodes.
Step 4: Run stochastic mapping simulations to generate evolutionary histories compatible with the best model and inferred ACE. (Only for categorical and biogeographic data)
Step 5: Infer ancestral states along branches.
For continuous traits: use interpolation to produce a contMap.
For categorical and biogeographic data: compute posterior frequencies of each state/range
to produce a densityMap for each state/range.
prepare_trait_data(
tip_data,
trait_data_type,
phylo,
seed = NULL,
evolutionary_models = NULL,
Q_matrix = NULL,
BioGeoBEARS_directory_path = NULL,
keep_BioGeoBEARS_files = TRUE,
prefix_for_files = NULL,
nb_cores = 1,
max_range_size = 2,
split_multi_area_ranges = FALSE,
...,
res = 100,
nb_simulations = 1000,
color_scale = NULL,
colors_per_levels = NULL,
plot_map = TRUE,
plot_overlay = TRUE,
add_ACE_pies = TRUE,
PDF_file_path = NULL,
return_ace = TRUE,
return_BSM = FALSE,
return_simmaps = FALSE,
return_best_model_fit = FALSE,
return_model_selection_df = FALSE,
verbose = TRUE
)
tip_data |
Named numerical or character string vector of trait values/states/ranges at tips.
Names should be ordered as the tip labels in the phylogeny found in |
trait_data_type |
Character string. Type of trait data. Either: "continuous", "categorical" or "biogeographic". |
phylo |
Time-calibrated phylogeny. Object of class |
seed |
Integer. Set the seed to ensure reproducibility. Default is |
evolutionary_models |
(Vector of) character string(s). To provide the set of evolutionary models to fit on the data.
|
Q_matrix |
Custom Q-matrix for categorical data representing transition classes between states.
Transitions with similar integers are estimated with a shared rate parameter.
Transitions with |
BioGeoBEARS_directory_path |
Character string. The path to the directory used to store input/output files generated for/by BioGeoBEARS during biogeographic historical inferences. Use '/' to separate directory and sub-directories. It must end with '/'. Only for biogeographic data. |
keep_BioGeoBEARS_files |
Logical. Whether the |
prefix_for_files |
Character string. Prefix to add to all BioGeoBEARS files stored in the |
nb_cores |
Interger. Number of cores to use for parallel computation during BioGeoBEARS runs. Default = |
max_range_size |
Integer. Maximum number of unique areas encompassed by multi-range areas. Default = |
split_multi_area_ranges |
Logical. Whether to split multi-area ranges across unique areas when mapping ranges.
Ex: For range EW, posterior probabilities will be split equally between Eastern Palearctic (E) and Western Palearctic (W).
Default = |
... |
Additional arguments to be passed down to the functions used to fit models (See |
res |
Integer. Define the number of time steps used to interpolate/estimate trait value/state/range in |
nb_simulations |
Integer. Define the number of simulations generated for stochastic mapping. Default = |
color_scale |
Vector of character string. List of colors to use to build the color scale with |
colors_per_levels |
Named character string. To set the colors to use to map each state/range posterior probabilities. Names = states/ranges; values = colors.
If |
plot_map |
Logical. Whether to plot or not the phylogeny with mapped trait evolution. Default = |
plot_overlay |
Logical. If |
add_ACE_pies |
Logical. Whether to add pies of posterior probabilities of states/ranges at internal nodes on the mapped phylogeny. Default = |
PDF_file_path |
Character string. If provided, the plot will be saved in a PDF file following the path provided here. The path must end with ".pdf". |
return_ace |
Logical. Whether the named vector of ancestral characters estimates (ACE) at internal nodes should be returned in the output. Default = |
return_BSM |
Logical. (Only for Biogeographic data) Whether the summary tables of anagenetic and cladogenetic events generated during the Biogeographic Stochastic Mapping (BSM)
process should be returned in the output. Default = |
return_simmaps |
Logical. Whether the evolutionary histories simulated during stochastic mapping (i.e., |
return_best_model_fit |
Logical. Whether to include the output of the best fitting model in the function output. Default = |
return_model_selection_df |
Logical. Whether to include the data.frame summarizing model comparisons used to select the best fitting model should be returned in the output. Default = |
verbose |
Logical. Should progression be displayed? A message will be printed for every steps in the process. Default is |
Map trait evolution on a time-calibrated phylogeny in several steps:
Step 1: Models are fit using Maximum Likelihood approach:
For "continuous" data models are fit with geiger::fitContinuous(): "BM", "OU", "EB", "rate_trend", "lambda", "kappa", "delta". Default is "BM".
For "categorical" data models are fit with geiger::fitDiscrete(): "ER", "SYM", "ARD", "meristic", "matrix". Default is "ARD".
For "biogeographic" data models are fit with R package BioGeoBEARS: "BAYAREALIKE", "DIVALIKE", "DEC", "BAYAREALIKE+J", "DIVALIKE+J", "DEC+J". Default is "DEC".
Step 2: Best model is identified among the list of evolutionary_models by comparing the corrected AIC (AICc)
and selecting the model with lowest AICc.
Step 3: For continuous traits: Ancestral characters estimates (ACE) are inferred with phytools::fastAnc on a tree
with modified branch lengths scaled to reflect the evolutionary rates estimated from the best model using phytools::rescale().
Step 4: Stochastic Mapping.
For categorical and biogeographic data, stochastic mapping simulations are performed to generate evolutionary histories compatible with the best model and inferred ACE. Node states/ranges are drawn from the scaled marginal likelihoods of ACE, and states/ranges shifts along branches are simulated according to the transition matrix Q estimated from the best fitting model.
Step 5: Infer ancestral states along branches.
For continuous traits: ancestral trait values along branches are interpolated with phytools::contMap().
This provides quick estimates of trait value at any point in time, but it does not provide accurate ML estimates in
case of models that are time or trait-value dependent (such as "EB" or "OU") as the interpolation used to built the contMap is assuming
a constant rate along each branch. However, ancestral trait values at nodes remain accurate
For categorical and biogeographic data: compute posterior frequencies of each state/range among the simulated evolutionary histories (simmaps)
to produce a densityMap for each state/range that reflects the changes along branches in probability of harboring a given state/range.
The function returns a list with at least two elements.
$contMap (For "continuous" data) Object of class "contMap", typically generated with phytools::contMap(),
that contains a phylogenetic tree and associated continuous trait mapping.
$densityMaps (For "categorical" and "biogeographic" data) List of objects of class "densityMap,
typically generated with phytools::densityMap(), that contains a phylogenetic tree and associated mapping of probability
to harbor a given state/range along branches. The list contains one "densityMap per state/range found in the tip_data.
$densityMaps_all_ranges (For "biogeographic" data only, if split_multi_area_ranges = TRUE) Same as $densityMaps,
but for all ranges including the multi-areas ranges (e.g., AB) while $densityMaps will display posterior probabilities for
unique areas only (e.g., A and B), with multi-areas ranges split across the unique areas they encompass.
$trait_data_type Character string. Record the type of trait data. Either: "continuous", "categorical" or "biogeographic".
If return_ace = TRUE,
$ace For continuous traits: Named vector that record the ancestral characters estimates (ACE) at internal nodes.
For categorical and biogeographic data: Matrix that record the posterior probabilities of ancestral states/ranges (characters) estimates (ACE) at internal nodes.
Rows are internal nodes. Columns are states/ranges. Values are posterior probabilities of each state per node.
$ace_all_ranges For biogeographic data, if split_multi_area_ranges = TRUE: Named vector that record the ancestral characters estimates (ACE) at internal nodes,
but including all ranges observed in the simmaps, including the multi-areas ranges (e.g., AB), while $ace will display posterior probabilities
for unique areas only (e.g., A and B), with multi-areas ranges split across the unique areas they encompass.
If return_BSM = TRUE, (Only for biogeographic data)
$BSM_output List of two lists that contains summary information of cladogenetic ($RES_caldo_events_tables) and anagenetic ($RES_ana_events_tables) events
recording across the N simulations of biogeographic histories performed during Biogeographic Stochastic Mapping (BSM).
Each element of the list is a data.frame recording events occurring during one simulation.
If return_simmaps = TRUE, (Only for categorical and biogeographic data)
$simmaps List that contains as many objects of class "simmap" that nb_simulations were requested.
Each simmap object is a phylogeny with one simulated geographic history (i.e., transitions in geographic ranges) mapped along branches.
If return_best_model_fit = TRUE,
$best_model_fit List that provides the output of the best fitting model.
If model_selection_df = TRUE,
$model_selection_df Data.frame that summarizes model comparisons used to select the best fitting model.
For biogeographic data, the function also produce input and output files associated with BioGeoBEARS and stored in the directory
specified in BioGeoBEARS_directory_path. The directory and its content are kept if keep_BioGeoBEARS_files = TRUE
This function provides an easy solution to map trait evolution on a time-calibrated phylogeny
and obtain the contMap/densityMaps objects needed to run the deepSTRAPP workflow (run_deepSTRAPP_for_focal_time, run_deepSTRAPP_over_time).
However, it does not explore the most complex options for trait evolution. You may need to explore more complex models to capture the dynamics of trait evolution.
such as trait-dependent multi-rate models (phytools::brownie.lite(), OUwie::OUwie), Bayesian MCMC implementations allowing a thorough exploration
of location and number of regime shifts (Ex: BayesTraits, RevBayes), or RRphylo for a penalized phylogenetic ridge regression approach that allows regime shifts across all branches.
This function provides an easy solution to infer ancestral geographic ranges using the BioGeoBEARS framework. It allows to directly compare model fits across 6 models: DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J. It uses a step-wise approach by using MLE estimates of previous runs as staring parameter values when increasing complexity (adding +J parameters). However, it does not explore the most complex options for historical biogeography. You may need to explore more complex models to capture the dynamics of range evolution such as time-stratification with adjacency matrices, dispersal multipliers (+W), distance-based dispersal probabilities (+X), or other features. See for instance, http://phylo.wikidot.com/biogeobears).
The R package BioGeoBEARS is needed for this function to work with biogeographic data.
Please install it manually from: https://github.com/nmatzke/BioGeoBEARS.
Maël Doré
For macroevolutionary models in geiger: Pennell, M. W., Eastman, J. M., Slater, G. J., Brown, J. W., Uyeda, J. C., FitzJohn, R. G., ... & Harmon, L. J. (2014). geiger v2. 0: an expanded suite of methods for fitting macroevolutionary models to phylogenetic trees. Bioinformatics, 30(15), 2216-2218. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/bioinformatics/btu181")}.
For BioGeoBEARS: Matzke, Nicholas J. (2018). BioGeoBEARS: BioGeography with Bayesian (and likelihood) Evolutionary Analysis with R Scripts. version 1.1.1, published on GitHub on November 6, 2018. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.5281/zenodo.1478250")}. Website: http://phylo.wikidot.com/biogeobears.
geiger::fitContinuous() geiger::fitDiscrete() phytools::contMap() phytools::densityMap()
For a guided tutorial, see the associated vignettes:
For continuous trait data: vignette("model_continuous_trait_evolution", package = "deepSTRAPP")
For categorical trait data: vignette("model_categorical_trait_evolution", package = "deepSTRAPP")
For biogeographic range data: vignette("model_biogeographic_range_evolution", package = "deepSTRAPP")
# ----- Example 1: Continuous data ----- #
## Load phylogeny and tip data
# Load eel phylogeny and tip data from the R package phytools
# Source: Collar et al., 2014; DOI: 10.1038/ncomms6505
data("eel.tree", package = "phytools")
data("eel.data", package = "phytools")
# Extract body size
eel_data <- stats::setNames(eel.data$Max_TL_cm,
rownames(eel.data))
# (May take several minutes to run)
## Map trait evolution on the phylogeny
mapped_cont_traits <- prepare_trait_data(
tip_data = eel_data,
trait_data_type = "continuous",
phylo = eel.tree,
evolutionary_models = c("BM", "OU", "lambda", "kappa"),
# Example of an additional argument ('control') that can be provided to geiger::fitContinuous()
control = list(niter = 200),
color_scale = c("darkgreen", "limegreen", "orange", "red"),
plot_map = FALSE,
return_best_model_fit = TRUE,
return_model_selection_df = TRUE,
verbose = TRUE)
## Explore output
plot_contMap(mapped_cont_traits$contMap) # contMap with interpolated trait values
mapped_cont_traits$model_selection_df # Summary of model selection
# Parameter estimates and optimization summary of the best model
# (Here, the best model is Pagel's lambda)
mapped_cont_traits$best_model_fit$opt
mapped_cont_traits$ace # Ancestral character estimates at internal nodes
# ----- Example 2: Categorical data ----- #
## Load phylogeny and tip data
# Load eel phylogeny and tip data from the R package phytools
# Source: Collar et al., 2014; DOI: 10.1038/ncomms6505
data("eel.tree", package = "phytools")
data("eel.data", package = "phytools")
# Transform feeding mode data into a 3-level factor
eel_data <- stats::setNames(eel.data$feed_mode, rownames(eel.data))
eel_data <- as.character(eel_data)
eel_data[c(1, 5, 6, 7, 10, 11, 15, 16, 17, 24, 25, 28, 30, 51, 52, 53, 55, 58, 60)] <- "kiss"
eel_data <- stats::setNames(eel_data, rownames(eel.data))
table(eel_data)
# Manually define a Q_matrix for rate classes of state transition to use in the 'matrix' model
# Does not allow transitions from state 1 ("bite") to state 2 ("kiss") or state 3 ("suction")
# Does not allow transitions from state 3 ("suction") to state 1 ("bite")
# Set symmetrical rates between state 2 ("kiss") and state 3 ("suction")
Q_matrix = rbind(c(NA, 0, 0), c(1, NA, 2), c(0, 2, NA))
# Set colors per states
colors_per_states <- c("limegreen", "orange", "dodgerblue")
names(colors_per_states) <- c("bite", "kiss", "suction")
# (May take several minutes to run)
## Run evolutionary models
eel_cat_3lvl_data <- prepare_trait_data(tip_data = eel_data, phylo = eel.tree,
trait_data_type = "categorical",
colors_per_levels = colors_per_states,
evolutionary_models = c("ER", "SYM", "ARD", "meristic", "matrix"),
Q_matrix = Q_matrix,
nb_simulations = 1000,
plot_map = TRUE,
plot_overlay = TRUE,
return_best_model_fit = TRUE,
return_model_selection_df = TRUE)
# Load directly output
data(eel_cat_3lvl_data, package = "deepSTRAPP")
## Explore output
plot(eel_cat_3lvl_data$densityMaps[[1]]) # densityMap for state n°1 ("bite")
eel_cat_3lvl_data$model_selection_df # Summary of model selection
# Parameter estimates and optimization summary of the best model
# (Here, the best model is ER)
print(eel_cat_3lvl_data$best_model_fit)$ # Summary of the best evolutionary model
eel_cat_3lvl_data$ace # Posterior probabilities of each state (= ACE) at internal nodes
# ----- Example 3: Biogeographic data ----- #
if (deepSTRAPP::is_dev_version())
{
## The R package 'BioGeoBEARS' is needed for this function to work with biogeographic data.
# Please install it manually from: https://github.com/nmatzke/BioGeoBEARS.
## Load phylogeny and tip data
# Load eel phylogeny and tip data from the R package phytools
# Source: Collar et al., 2014; DOI: 10.1038/ncomms6505
data("eel.tree", package = "phytools")
data("eel.data", package = "phytools")
# Transform feeding mode data into biogeographic data with ranges A, B, and AB.
eel_data <- stats::setNames(eel.data$feed_mode, rownames(eel.data))
eel_data <- as.character(eel_data)
eel_data[eel_data == "bite"] <- "A"
eel_data[eel_data == "suction"] <- "B"
eel_data[c(5, 6, 7, 15, 25, 32, 33, 34, 50, 52, 57, 58, 59)] <- "AB"
eel_data <- stats::setNames(eel_data, rownames(eel.data))
table(eel_data)
colors_per_ranges <- c("dodgerblue3", "gold")
names(colors_per_ranges) <- c("A", "B")
# (May take several minutes to run)
## Run evolutionary models
eel_biogeo_data <- prepare_trait_data(
tip_data = eel_data,
trait_data_type = "biogeographic",
phylo = eel.tree,
# Default = "DEC" for biogeographic
evolutionary_models = c("BAYAREALIKE", "DIVALIKE", "DEC",
"BAYAREALIKE+J", "DIVALIKE+J", "DEC+J"),
BioGeoBEARS_directory_path = tempdir(), # Ex: "./BioGeoBEARS_directory/"
keep_BioGeoBEARS_files = FALSE,
prefix_for_files = "eel",
max_range_size = 2,
split_multi_area_ranges = TRUE, # Set to TRUE to display the two outputs
# Reduce the number of Stochastic Mapping simulations to save time (Default = '1000')
nb_simulations = 100,
colors_per_levels = colors_per_ranges,
return_simmaps = TRUE,
return_best_model_fit = TRUE,
return_model_selection_df = TRUE,
verbose = TRUE)
# Load directly output
data(eel_biogeo_data, package = "deepSTRAPP")
## Explore output
str(eel_biogeo_data, 1)
eel_biogeo_data$model_selection_df # Summary of model selection
# Parameter estimates and optimization summary of the best model
# (Here, the best model is DEC+J)
eel_biogeo_data$best_model_fit$optim_result
# Posterior probabilities of each state (= ACE) at internal nodes
eel_biogeo_data$ace # Only with unique areas
eel_biogeo_data$ace_all_ranges # Including multi-area ranges (Here, AB)
## Plot densityMaps
# densityMap for range n°1 ("A")
plot(eel_biogeo_data$densityMaps[[1]])
# densityMaps with all unique areas overlaid
plot_densityMaps_overlay(eel_biogeo_data$densityMaps)
# densityMaps with all ranges (including multi-area ranges) overlaid
plot_densityMaps_overlay(eel_biogeo_data$densityMaps_all_ranges)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.