Contributors:
David Duchene david.duchene[at]sydney.edu.au
Sebastian Duchene sebastian.duchene[at]unimelb.edu.au
Luiz Carvalho
If you use this package for your research, please cite:
Ho, S. Y. W., Duchêne, S., Duchêne, D. (2015) Simulating and detecting autocorrelation of molecular evolutionary rates among lineages. 15(4) 689-696.
To Edit: Models for molecular rate variation play a key role in molecular phylogenetics. Due to their importance in evolutionary biology, there is a wide variety of models, which can be classified into five broad categories:
Strict clock, where a single rate is assumed for all the branches of a phylogeny, as described by Zuckerkandl, E. and Pauling, L. (1962).
Local clock, in which there is a fixed number of strict clocks in a phylogenetic tree, so that the rate is constant for some lineages (Hasegawa et al. 1989, Yoder and Yang. 2000).
Autocorrelated models, where rates along a lineage evolve gradually, and therefore display a degree of correlation, such as that described by Thorne et al. (1998).
Uncorrelated models, where the rates for each of the branches are independently and identically distributed variables, drawn from a specified probability function, sucha as the uncorrelated lognormal clock (Drummond et al. 2006).
Unconstrained models, where the rates for all branches are independent and drawn from different distributions (Lepage et al 2007).
With the increasing development of clock models, it is necessary to assess their performance with simulations and analyses of empirical data. We present NELSI, an R package to simulate rates of evolution along phylogenetic trees. The principle is similar to the program RateEvolver (Ho et al. 2005), but it allows more flexibility and it can be easily combined with popular programs to simulate phylogenetic trees. In the current version we have implemented some of the most popular methods, but the pacakage is under constant development and we will include more models, as they become available. For requests and reporting bugs please contact sebastian.duchene[at]sydney.edu.au.
NELSI is implemented in the R programming language, and it is available as a package in github. It is compatible with some popular phylogenetic packages in R, such as Ape (Popescu et al. 2012) and phangorn (Schliep 2011) making it accessible to users familiar with phylogenetic data in R. The main functions use phylogenetic trees of class phylo, with branch lengths representing units of time. Trees estimated in other programs can be imported with ape in NEWICK or NEXUS format. Some R packages that simulate phylogenetic trees, such as geiger and TreeSim, also produce trees of class phylo, which can be used directly with NELSI. An important requirement of simulating rates of evolution along phylogenetic trees is that the trees should correspond to chronograms, with branch lengths in units of time.
NELSI requires a recent version of R (>=2.5). If R is not installed in your machine, download and install the appropriate version here.
R packages can be downloaded and installed directly from github with the package devtools. This is the easiest way to install NELSI (and many other packages).
We will begin by installing devtools from the Comprehensive R Archive Network (CRAN). Please follow the instructions bellow:
Open the R console by clicking on the R icon in your desktop or in Applications (depending on the operating system)
Make sure that you have an internet connection and type in the code bellow:
install.packages("devtools")
Follow the instructions in the prompt.
library(devtools)
install_github("sebastianduchene/NELSI")
library(NELSI)
This is all for the installation of NELSI. Please contact the authors to report any bugs.
In the next sections of this tutorial we show an overview of some of the functions available. For a more comprehensive list, please see the manual by typing the following code in the R console:
help(package = NELSI)
Some examples used to generate the data for the study by Ho. et al (In Prep) are available here.
To simulate rates of evolution we need a phylogenetic tree in which the branch lengths represent units of time, known as a chronogram. We can simulate this kind of tree in R, but for this tutorial we will load the tree in the example_data folder in github.
myTree <- read.tree("tr_example.tree")
plot(myTree)
node.ages <- round(branching.times(myTree), 2)
nodelabels(node.ages, bg = "white")
The simplest rate simulation model in NELSI is a strict clock, where every branch is given the same rate, with a user-specified noise level. To simulate rates under this model for our chronogram we use the function simulate.clock, which receives as arguments the chronogram, and two parameters: the mean rate and the amount of noise.
clock.sim <- simulate.clock(myTree, params = list(rate = 0.03, noise = 0))
The variable clock.sim is an object of class ratesim, which is the output of all the rate simulation functions in NELSI. ratesim objects have two elements. The first is a phylogram (our input topology but with branch lengths in terms of substitutions). The second element in an object of class ratesim is a tree.data.matrix, which is a matrix with all the data about a phylogeny, includng the simulated data. The columns of a tree.data.matrix are the following: (1) is the index of each branch; (2) and (3) are the edge attribute of the class phylo, showing the parent and daughter nodes for each branchrespectively; (4) is the mid age of each branch; (5) is the simulated molecular rate for every branch; (6) is the branch lengths in substitutions per site; and (7) is branch lengths in time units.
clock.sim$tree.data.matrix
# branch.index parent.node daughter.node branch.midage branch.rate
# [1,] 1 11 12 0.325090501 0.03
# [2,] 2 12 13 0.080306199 0.03
# [3,] 3 13 1 0.009929729 0.03
# [4,] 4 13 2 0.009929729 0.03
# [5,] 5 12 3 0.070376470 0.03
# [6,] 6 11 14 0.440591467 0.03
plot(clock.sim, col.lineages = rainbow(20), type = "s")
One way to relax the assumption of having a single rate throughout is to propose small changes in rate from one branch to the next. The functions simulate.autocor.kishino and simulate.autocor.thorne use the method described in Kishino et al.(2001) and Thorne et al.(1998) methods to simulate this kind of rate pattern. In both functions the user only needs to provide the rate at the root of the phylogeny and the amount of autocorrelation, given by the parameter v.
sim.low.autocor <- simulate.autocor.kishino(myTree, params = list(initial.rate = 0.01,
v = 0.1))
sim.high.autocor <- simulate.autocor.kishino(myTree, params = list(initial.rate = 0.01,
v = 0.006))
plot(sim.low.autocor, col.lineages = rainbow(20), type = "s")
plot(sim.high.autocor, col.lineages = rainbow(20), type = "s")
To simulate rates that are uncorrelated among branches, but are independently and identically drawn from a parent distribution, we have implemented three different models for rate simulation. Each function requires different input parameters, as described in Drummond et al. (2006).
sim.uncor <- simulate.uncor.lnorm(myTree, params = list(mean.log = -3.9, sd.log = 0.7))
plot(sim.uncor, col.lineages = rainbow(20), type = "s")
There are other methods for rate simulation in NELSI, but this tutorial covers the most well-known models. Please refer to the package doccumentation and help files for a full list of functions.
We can use the package phangorn to simulate evolvution of nucleotide or amino-acid sequence alignments along a phylogram (the first element of the ratesim object), and save it in an external file in a any format, such as FASTA, for future use.
sim.dna.data <- simSeq(sim.uncor[[1]], l = 2000, type = "DNA")
write.phyDat(sim.dna.data, file = "nelsi_tutorial_dna.fasta", format = "fasta")
Note that the function simSeq can simulate under different models of nucleotide substitution. Use ?simSeq to see details.
write.tree(sim.uncor[[1]], file = "nelsi_tutorial_pylogram.tree")
Phylogenetic trees in NEXUS format can have a large number of annotations, with information about rates, times, or other traits for every branch in the tree. These annotations can be read in R with the function read.annotated.nexus, from pacakge epibase. Then the annotations cat be imported into a tree data matrix to be used with NELSI. Note that epibase is no longer supported, but we have made this function available here.
The examples_data folder contains a tree from ENV sequences from HIV-1 sub-type A collected between 1991 and 2008.
hivTree <- read.annotated.nexus("hiv_A_env.tree")
plot(hivTree)
axisPhylo()
The tree is a chronogram, so that the branch lengths represent units of time. The ages of the nodes can be obtained with the function branching.times, but this only works for ultrametric trees, which is not the case for these data because the samples were obtained at different points in time (heterochronous). The function allnode.times in NELSI can obtain the ages of the nodes and tips for heterochronous trees. The first items of the function are the ages of the tips, and the remaining are the ages of internal nodes.
plot(hivTree, show.tip.label = F)
tip.ages <- round(allnode.times(hivTree), 2) # Round to two decimal places for a clearer plot
# See the tip ages. The first 19 elements are the ages of the tips (the tree
# has 19 tips), while the remaining are the ages of internal nodes
tiplabels(tip.ages[1:19])
nodelabels(tip.ages[20:37])
axisPhylo()
The age of the youngest tip is always assigned an age of 0. The age of the root is calculated with reference to the youngest tip.
We will use the tree in 7. to obtain the tree data matrix and plot the rates through time.
hivDataMatrix <- as.data.frame(trann2trdat(hivTree))
head(hivDataMatrix)
## branch parent daughter midage rate blensubs blentime
## 1 1 20 21 154.55 0.001301 0.02622 20.159
## 2 2 21 22 99.35 0.001314 0.12836 97.721
## 3 3 22 1 32.00 0.001286 0.05929 46.113
## 4 4 22 23 49.50 0.001274 0.01112 8.730
## 5 5 23 2 31.84 0.001332 0.03859 28.980
## 6 6 23 24 41.23 0.001291 0.01154 8.936
For heterochronous data one can test the molecular clock by conducting a regression of the number of substitutions from the root to the tips vs. the time from the root to the tip, like in the program TempEst (Rambaut et al. 2016). With the help of a few functions from NELSI, we can conduct these analyses in R.
tipsTimes <- allnode.times(hivTree, tipsonly = T)
hivPhylogram <- hivTree
hivPhylogram$edge.length <- hivDataMatrix$blensubs
tipsSubstitutions <- allnode.times(hivPhylogram, tipsonly = T)
plot(tipsTimes, tipsSubstitutions, pch = 20, ylab = "Substitutions from the root to tips (substitutions)",
xlab = "Time from the root to the tip (years)")
hivRegression <- lm(tipsSubstitutions ~ tipsTimes)
abline(hivRegression)
summary(hivRegression)
##
## Call:
## lm(formula = tipsSubstitutions ~ tipsTimes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.007700 -0.000927 0.000179 0.002335 0.006609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000912 0.002026 0.45 0.66
## tipsTimes 0.001233 0.000189 6.52 5.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00386 on 17 degrees of freedom
## Multiple R-squared: 0.714, Adjusted R-squared: 0.697
## F-statistic: 42.5 on 1 and 17 DF, p-value: 5.25e-06
In this case it the data appear to have clock-like behaviour.
Drummond, A. J., Ho, S. Y., Phillips, M. J., & Rambaut, A. (2006). Relaxed phylogenetics and dating with confidence. PLOS Biology, 4(5), e88.
Hasegawa, M., Kishino, H., & Yano, T. A. (1989). Estimation of branching dates among primates by molecular clocks of nuclear DNA which slowed down in Hominoidea. Journal of Human Evolution, 18(5), 461-476.
Heath, T. A., Holder, M. T., & Huelsenbeck, J. P. (2012). A dirichlet process prior for estimating lineage-specific substitution rates. Molecular Biology and Evolution, 29(3), 939-955.
Ho, S. Y., Phillips, M. J., Drummond, A. J., & Cooper, A. (2005). Accuracy of rate estimation using relaxed-clock models with a critical focus on the early metazoan radiation. Molecular Biology and Evolution, 22(5), 1355-1363.
Ho, S. Y., Duchêne, S., Duchêne, D. (2015) Simulating and detecting autocorrelation of molecular evolutionary rates among lineages. 15(4) 689-696.
Kishino, H., Thorne, J. L., & Bruno, W. J. (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution, 18(3), 352-361.
Lepage, T., Bryant, D., Philippe, H., & Lartillot, N. (2007). A general comparison of relaxed molecular clock models. Molecular Biology and Evolution, 24(12), 2669-2680.
Popescu, A. A., Huber, K. T., & Paradis, E. (2012). Ape 3.0: New tools for distance-based phylogenetics and evolutionary analysis in R. Bioinformatics, 28(11), 1536-1537.
Rambaut, A., Lam, T., Carvalho, L., Pybus, O. (2016) Exploring the temporal structure of heterochronous sequences using TempEst. Virus Evolution 2: vew007
Saulnier, E., Gascuel, O., Alizon, S. (2016) Assessing the accuracy of Approximate Bayesian Computation approaches to infer epidemiological parameters from phylogenies. bioRxiv doi: http://dx.doi.org/10.1101/050211.
Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592-593.
Thorne, J.L., Kishino, H., and Painter, I.S., Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15.12 (1998): 1647-1657.
Yoder, A. D., & Yang, Z. (2000). Estimation of primate speciation dates using local molecular clocks. Molecular Biology and Evolution, 17(7), 1081-1090.
Zuckerkandl,E. and Pauling,L. (1962) Molecular disease, evolution, and genic heterogeneity. In: Kasha,M. and Pullman,B. (eds) Horizons in Biochemistry. Academic Press, New York, pp. 189–225.
New Features 2 June 2022 - Functions to simulate fixed local clocks. - TO DO: Document and add example
Lineage functions to obtain all monophyletic groups that match a regular expression (find.monophyletic).
Simulation tools for fixed local clocks (simulate.FLC).
New function get.ltt.summary.R calculates some statistics for the lineages through time plots of non-ultrametric trees. Some of these statistics were first used by Saulnier et al. 2016.
New function dist.topo.normalised estimates PH85 tree topology distance divided by the maximum distance between two trees of the same size and structure. This is done by randomising the tip labels for one of the trees to generate a null distribution of the PH85 tree topology distance.
New functions get.ancestor.nodes.branches and get.descending.nodes.branches to manipulate rooted phylogenetic trees. Help for these funcitons pending.
New function to simulate rates under the dirichlet process, as described by Heath et al. 2012.
New function to obtain node times for internal nodes only. Can be used for ultrametric and non-ultrametric trees.
Experimental function to get all the tips that descend from a node. Similar to function 'tips' from GEIGER.
6 November 2014
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.