experimental/README.md

NELSI: Nucleotide EvoLutionary Rate Simulator

David Duchene and Sebastian Duchene

david.duchene[at]anu.edu.au

sebastian.duchene[at]sydney.edu.au

X April 2014

Introduction

Models for molecular rate variation play a key role in molecular phylogenetics. Due to their importance in evolutionary biology, several of these models have been proposed, falling into three broad categories: the strict clock, where a single rate is assumed for all the branches of a phylogeny; autocorrelated models, where rates along a lineage have a degree of correlation; and uncorrelated models, where the rate at each branch is independently and identically drawn from a specified distribution. In addition, models have been used for statistical purposes, like the white noise model (Lepage et al. 2007), or for correction of patterns in molecular sequences that might cause bias to molecular rate estimates, like the selection-correction model (ADD CITATION).

With the increasing development of clock models, it is necessary to assess their performance with simulations and analyses of empirical data. In this note, we present a package that builds on existing phylogenetic applications to simulate molecular rates in phylogenies using a broad set of existing models for molecular evolution and the graphical flexibility of R.

Description

NELSI is implemented in the R programming language, and it is available as a package in github. It is compatible with some popular phylogenetic packages in R, such as ape (Paradis et al. 2004) and phangorn (Schliep 2011) making it accessible to users familiar with phylogenetic data in R. The main functions use phylogenetic trees of class phylo, with branch lengths representing units of time. Trees estimated in other programs can be imported with ape in NEWICK or NEXUS format. Some R packages that simulate phylogenetic trees, such as geiger and TreeSim, also produce trees of class phylo, which can be used directly with NELSI. An important aspect of simulating rates of evolution along phylogenetic trees is that the trees should correspond to chronograms, with branch lengths in units of time.

Tutorial

1. Installation and setup

NELSI requires a recent version of R (>=2.5). If R is not installed in your machine, download and install the appropriate version here.

R packages can be downloaded and installed directly from github with the package devtools. This is the easiest way to install NELSI (and many other packages).

We will begin by installing devtools from the Comprehensive R Archive Network (CRAN). Please follow the instructions bellow:

install.packages("devtools")

Follow the instructions in the prompt.

library(devtools)
install_github(rep = "NELSI", username = "sebastianduchene")
library(NELSI)

This is all for the installation of NELSI. Please contact the authors to report any bugs.

In the next sections of this tutorial we show an overview of some of the functions available. For a more comprehensive list, please see the manual by typing the following code in the R console:

help(package = NELSI)

2. Loading phylogenetic trees

To simulate rates of evolution we need a phylogenetic tree in which the branch lengths repesetn units of time, known as a chronogram. We can simulate this kind of tree in R, but for this tutorial we will load the tree in the example_data folder in github.

myTree <- read.tree("tr_example.tree")
plot(myTree)
node.ages <- round(branching.times(myTree), 2)
nodelabels(node.ages, bg = "white")

plot of chunk unnamed-chunk-3

3. Simulate constant rates through time

The simplest rate simulation model in NELSI is a strict clock, where every branch is given the same rate, with a user-specified noise level. To simulate rates under this model for our chronogram we use the function simulate.cock, which receives as arguments the chronogram, and two parameters: the mean rate and the amount of noise.

clock.sim <- simulate.clock(myTree, params = list(rate = 0.03, noise = 1e-04))
## Loading required package: phangorn

The output is an object of class ratesim, which is the output of all the rate simulation functions in NELSI. ratesim objects have two elements. The first is a phylogram (our input topology but with branch lengths in terms of substitutions). The second element in an object of class ratesim is a tree.data.matrix, which is a matrix with all the data about a phylogeny, includng the simulated data. The columns of a tree.data.matrix are the following: (1) is the index of each branch; (2) and (3) are the edge attribute of the class phylo, showing the parent and daughter nodes for each branchrespectively; (4) is the mid age of each branch; (5) is the simulated molecular rate for every branch; (6) is the branch lengths in substitutions per site; and (7) is branch lengths in time units.

plot(clock.sim, col.lineages = rainbow(20), type = "s")

Figure 1

4. Simulate autocorrelated rates

One way to relax the assumption of having a single rate throughout is to propose small changes in rate from one branch to the next. The functions simulate.autocor.kishino and simulate.autocor.thorne use the method described in Kishino et al.(2001) and Thorne et al.(1998) methods to simulate this kind of rate pattern. In both functions the user only needs to provide the rate at the root of the phylogeny and the amount of autocorrelation, given by the parameter v.

sim.low.autocor <- simulate.autocor.kishino(myTree, params = list(initial.rate = 0.01, 
    v = 0.1))
sim.high.autocor <- simulate.autocor.kishino(myTree, params = list(initial.rate = 0.01, 
    v = 0.003))
plot(sim.low.autocor, col.lineages = rainbow(20), type = "s")

plot of chunk unnamed-chunk-6

plot(sim.high.autocor, col.lineages = rainbow(20), type = "s")

plot of chunk unnamed-chunk-7

5. Simulate uncorrelated lognormal rates

To simulate rates that are uncorrelated among branches, but are independently and identically drawn from a parent distribution, we have implemented three different models for rate simulation. Each function requires different input parameters, as described in Drummond et al. (2006).

sim.uncor <- simulate.uncor.lnorm(myTree, params = list(mean.log = -3.9, sd.log = 0.7))
plot(sim.uncor, col.lineages = rainbow(20), type = "s")

plot of chunk unnamed-chunk-8

There are other methods for rate simulation in NELSI, but this tutorial covers the most well-known models. Please refer to the package doccumentation and help files for a full list of functions.

6. Simulate nucleotide sequences using phangorn and exporting the data

We can use the package phangorn to simulate evolvution of nucleotide or amino-acid sequence alignments along a phylogram (the first element of the ratesim object), and save it in an external file in a any format, such as FASTA, for future use.

sim.dna.data <- simSeq(sim.uncor[[1]], l = 2000, type = "DNA")
write.phyDat(sim.dna.data, file = "nelsi_tutorial_dna.fasta", format = "fasta")

Note that the function simSeq can simulate under different models of nucleotide substitution. Use ?simSeq to see details.

write.tree(sim.uncor[[1]], file = "nelsi_tutorial_pylogram.tree")

7. Loading a virus data set estimated in BEAST

Phylogenetic trees in NEXUS format can have a large number of annotations, with information about rates, times, or other traits for every branch in the tree. These annotations can be read in R with the function read.annotated.nexus, from pacakge epibase. Then the annotations cat be imported into a tree data matrix to be used with NELSI. Note that epibase is no longer supported, but we have made this function available here.

The examples_data folder contains a tree from ENV sequences from HIV-1 sub-type A collected between 1991 and 2008.

hivTree <- read.annotated.nexus("hiv_A_env.tree")
plot(hivTree)
axisPhylo()

plot of chunk unnamed-chunk-12

The tree is a chronogram, so that the branch lengths represent units of time. The ages of the nodes can be obtained with the function branching.times, but this only works for ultrametric trees, which is not the case for these data because the samples were obtained at different points in time (heterochronous). The function allnode.times in NELSI can obtain the ages of the nodes and tips for heterochronous trees. The first items of the function are the ages of the tips, and the remaining are the ages of internal nodes.

plot(hivTree, show.tip.label = F)
tip.ages <- round(allnode.times(hivTree), 2)  # Round to two decimal places for a clearer plot
# See the tip ages. The first 19 elements are the ages of the tips (the tree
# has 19 tips), while the remaining are the ages of internal nodes
tiplabels(tip.ages[1:19])
nodelabels(tip.ages[20:37])
axisPhylo()

plot of chunk unnamed-chunk-13

The age of the youngest tip is always assigned an age of 0. The age of the root is calculated with reference to the youngest tip.

8. Obtaining the tree data matrix for a tree estimated in BEAST

We will use the tree in 7. to obtain the tree data matrix and plot the rates through time.

hivDataMatrix <- trann2trdat(hivTree)

head(hivDataMatrix)
##   branch parent daughter midage     rate blensubs blentime
## 1      1     20       21 154.55 0.001301  0.02622   20.159
## 2      2     21       22  99.35 0.001314  0.12836   97.721
## 3      3     22        1  32.00 0.001286  0.05929   46.113
## 4      4     22       23  49.50 0.001274  0.01112    8.730
## 5      5     23        2  31.84 0.001332  0.03859   28.980
## 6      6     23       24  41.23 0.001291  0.01154    8.936

9. Root-to-tip regressions for trees estimated in BEAST

For heterochronous data one can test the molecular clock by conducting a regression of the number of substitutions from the root to the tips vs. the time from the root to the tip, like in the program Path-o-Gen (Rambaut et al. 2009). With the help of a few functions from NELSI, we can conduct these analyses in R.

tipsTimes <- allnode.times(hivTree, tipsonly = T)
hivPhylogram <- hivTree
hivPhylogram$edge.length <- hivDataMatrix$blensubs
tipsSubstitutions <- allnode.times(hivPhylogram, tipsonly = T)
plot(tipsTimes, tipsSubstitutions, pch = 20, ylab = "Substitutions from the root to tips (substitutions)", 
    xlab = "Time from the root to the tip (years)")
hivRegression <- lm(tipsSubstitutions ~ tipsTimes)
abline(hivRegression)

plot of chunk unnamed-chunk-18

summary(hivRegression)
## 
## Call:
## lm(formula = tipsSubstitutions ~ tipsTimes)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.007700 -0.000927  0.000179  0.002335  0.006609 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.000912   0.002026    0.45     0.66    
## tipsTimes   0.001233   0.000189    6.52  5.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00386 on 17 degrees of freedom
## Multiple R-squared:  0.714,  Adjusted R-squared:  0.697 
## F-statistic: 42.5 on 1 and 17 DF,  p-value: 5.25e-06

In this case it the data appear to have clock-like behaviour.

References

Drummond, A. J., Ho, S. Y., Phillips, M. J., & Rambaut, A. (2006). Relaxed phylogenetics and dating with confidence. PLOS biology, 4(5), e88.

Kishino, H., Thorne, J. L., & Bruno, W. J. (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution, 18(3), 352-361.

Lepage, T., Bryant, D., Philippe, H., & Lartillot, N. (2007). A general comparison of relaxed molecular clock models. Molecular Biology and Evolution, 24(12), 2669-2680.

Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20(2), 289-290.

Rambaut, A. (2009). Path-O-Gen: temporal signal investigation tool.

Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592-593.

Thorne, J.L., Kishino, H., and Painter, I.S., Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15.12 (1998): 1647-1657.



sebastianduchene/NELSI documentation built on Aug. 18, 2022, 11:45 p.m.