In richelbilderbeek/wiritttes: WIRITTE simulation

This document shows some work-out examples.

These worked-out examples show:

the error if protractedness is absent
the error if protractedness strong
comparing these errors
the error if protractedness is absent for multiple replicates
the error if protractedness is strong for multiple replicates
comparing these errors

Workflow

Every experiment has the following steps:

Starting up
Creating parameter files
Per parameter file:
Simulate one incipient species tree per parameter file
Once or more often, sample unique species from an incipient species tree
Per sampled species tree, add an outgroup
Simulate one or more alignments per species tree
Once or more often, let BEAST2 infer a posterior per alignment
Compare original, sampled species tree, to the posterior
Per two parameter files:
Compare original, sampled species tree, to the posterior

Starting up

We will need to load some libraries:

library(ape)
library(wiritttes)
library(ggplot2)
library(nLTT)
library(phangorn)
options(warn = 2)

We also set the resolution of our analysis:

dt <- 0.001

I use 0.1 while developing code, 0.001 for a decent analysis.

Creating parameter files

This worked-out example will show:

the error if protractedness is absent
the error if protractedness strong
comparing these errors
the error if protractedness is absent for multiple replicates
the error if protractedness is strong for multiple replicates
comparing these errors

For that, I will create four parameter files:

analyse_toy_examples_1.RDa: protractedness is absent
analyse_toy_examples_2.RDa: protractedness strong
analyse_toy_examples_3.RDa: protractedness is absent for multiple replicates
analyse_toy_examples_4.RDa: protractedness is strong for multiple replicates

Here the parameter files are created with the desired parameters:

filenames <- paste0("analyse_toy_examples_", seq(1, 4), ".RDa")
wiritttes::create_test_parameter_files(filenames = filenames)

analyse_toy_examples_1.RDa: protractedness is absent, thus speciation completion rate is high, in this case 1.0e6
analyse_toy_examples_2.RDa: protractedness strong, thus speciation completion rate is low, inde this case 1.0e-1
analyse_toy_examples_3.RDa: protractedness is absent for multiple replicates, thus number of sampled species trees, alignments per species trees, and BEAST2 runs are all 2
analyse_toy_examples_4.RDa: protractedness is strong for multiple replicates, thus number of sampled species trees, alignments per species trees, and BEAST2 runs are all 2

Here the parameters are shown in a nicer format:

for (filename in filenames) {
  knitr::kable(wiritttes::read_file(filename)$parameters)
}

The parameter settings used in this example are identical, except for their speciation completion rate and number of replicates.

The experimental setup of this research has multiple steps, which we will follow closely here.

These worked-out examples show the data produced in its raw form and does not care too much about aesthetics.

Example 1: Weak protractedness

This example answers the question: what is the base level error of the analyses in this research?

The base level error can be obtained by using parameters for a constant-rate birth-death model. All tools used assume this model, but there will be noise (thus error) added in the process.

The parameter settings of example 1 have a high speciation completion rate \lambda, which makes the constant-rate protracted speciation model fall back to a constant-rate birth-death model, as incipient species become good species (close to) instantaneously.

Simulate one incipient species tree per parameter file

Here we simulate the 'true' incipient species tree:

filename <- filenames[1]
testit::assert(wiritttes::is_valid_file(filename))
wiritttes::add_pbd_output(filename)

This is how the incipient species tree looks like:

colors <- stats::setNames(c("gray", "black"), c("i", "g"))
testit::assert(length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label) > 0)
phytools::plotSimmap(
  wiritttes::read_file(filename)$pbd_output$igtree.extant, 
  colors = colors
)
n_taxa_istree <- length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label)

There are r n_taxa_istree taxa in the incipient species tree. All taxa are good species, so the edges of the phylogeny have only one color. The taxon labels are S[genus]-[species]-[sub-species]'.

Now, we plot the nLTT plot of this phylogeny:

nltt_values <- get_nltt_values(
  phylogenies = list(wiritttes::read_file(filename)$pbd_output$tree), 
  dt = 0.001
)
qplot(
  t, nltt, data = nltt_values, geom = "blank", ylim = c(0,1),
  main = "Example #1"
) + stat_summary(
  fun.data = "mean_cl_boot", color = "red", geom = "smooth"
)

Note that get_nltt_values is supplied a list of the one incipient species tree. This is because get_nltt_values works on a list of phylogenies.

Once or more often, sample unique species from an incipient species tree. Also add an outgroup

From that incipient species tree, we create a species tree by sampling one individual per incipient species. Because speciation is for a constant-rate birth death model, there exist no multiple individuals per species. To being able to root our phylogenies in later steps, an outgroup is added as well.

wiritttes::add_species_trees(filename)

Here we observe the sample species tree:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)
n_taxa_sstree <- length(wiritttes::read_file(filename)$species_trees[[1]][[1]]$tip.label)

In the sampled species tree, there are r n_taxa_sstree taxa, where in the incipient species tree we had r n_taxa_istree taxa.

Note that in example #3, there will be multiple sampled species trees.

Simulate one or more alignments per species tree

Knowing the evolutionary distances between species, DNA sequence alignments can be simulated fitting the species tree with outgroup. To do so, the parameters for sequence length and mutation rate are used.

Here we simulate one DNA sequence on our one sampled species tree with outgroup:

add_alignments(filename)

Note that this research assumes a simple Jukes-Cantor model, and does so as well in later steps.

Let's see the alignment:

alignment <- get_alignment(file = read_file(filename), sti = 1, ai = 1)
ape::image.DNAbin(alignment)

n_taxa_alignment <- length(
  labels(
    get_alignment_by_index(
      file = wiritttes::read_file(filename), 
      i = 1
    )
  )
)

In the alignment, there are r n_taxa_alignment taxa, where in the sampled species tree, there are r n_taxa_sstree taxa.

Once or more often, let BEAST2 infer a posterior per alignment

With BEAST2 we can now obtain a posterior. A posterior consists of a representative sample of all possible trees (and parameter estimates), yet with more probable trees being present more often.

Here we let BEAST2 do so:

add_posteriors(filename)

After running BEAST2 on our DNA sequence, the full posterior must be verified to be eligible for further analysis. Using Tracer, we can open the .log file generated by BEAST2, which has to be done manually.

It can be seen that the values for ESS (Effective Sample Size) are above 200 and that the trace log shows a well-mixed chain. An ESS of 200 is used as a minimum in this research.

A posterior contains many phylogenies.

phylogenies <- wiritttes::get_posteriors(wiritttes::read_file(filename))[[1]][[1]]$trees
# To get the densiTree function working, phylogenies must be of class multiphylo
class(phylogenies) <- "multiPhylo"

Then, these posterior trees are plotted:

densiTree(
  phylogenies, 
  type = "cladogram", 
  alpha = 1/length(phylogenies)
)

From this plot, it can be seen which phylogeny configurations pop up most in the posterior.

Compare original, sampled species tree, to the posterior

To refresh our mind, the true sampled species tree is:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)

Now that the full posterior is assumed to be correct, first, I will now highlight one of its posterior states, before going back to the full picture.

In this case, I choose a random state to zoom in on. From this last state, I show the tree only:

# plot_posterior_samples(filename)

This random tree may be very different by chance, as unlikely trees are present, yet in low abundances.

The original sampled tree is very probably different from this random posterior tree.

How different then?

The package wiritttea analyses exactly this.

Example 2: Strong protractedness

This example answers the question: how do the analyses of this research look like for strong protractedness?

The pipeline is identical, except one parameter is changed: the speciation completion rate is set to a low value.

Simulate one incipient species tree per parameter file

Here we simulate the 'true' incipient species tree:

filename <- filenames[2]
testit::assert(wiritttes::is_valid_file(filename))
add_pbd_output(filename)

This is how the incipient species tree looks like:

colors <- stats::setNames(c("gray", "black"), c("i", "g"))
testit::assert(length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label) > 0)
phytools::plotSimmap(
  wiritttes::read_file(filename)$pbd_output$igtree.extant, 
  colors = colors
)

Now, we plot the nLTT plot of this phylogeny:

nltt_values <- get_nltt_values(
  phylogenies = list(wiritttes::read_file(filename)$pbd_output$tree), 
  dt = 0.001
)
qplot(
  t, nltt, data = nltt_values, geom = "blank", ylim = c(0,1),
  main = "Example #2 incipient species tree"
) + stat_summary(
  fun.data = "mean_cl_boot", color = "red", geom = "smooth"
)

Once or more often, sample unique species from an incipient species tree. Also add an outgroup

add_species_trees(filename)

Here we observe the sample species tree:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)

Note that in example #4, there will be multiple sampled species trees.

Simulate one or more alignments per species tree

Here we simulate one DNA sequence on our one sampled species tree with outgroup:

add_alignments(filename)

Let's see the alignment:

alignment <- get_alignment(file = read_file(filename), sti = 1, ai = 1)
ape::image.DNAbin(alignment)

Once or more often, let BEAST2 infer a posterior per alignment

Here we let BEAST2 infer a posterior:

add_posteriors(filename)

A posterior contains many phylogenies. To visualize these, they are extracted from a .trees file here:

phylogenies <- wiritttes::get_posteriors(wiritttes::read_file(filename))[[1]][[1]]$trees
# To get the densiTree function working, phylogenies must be of class multiphylo
class(phylogenies) <- "multiPhylo"

Then, these posterior trees are plotted:

densiTree(
  phylogenies, 
  type = "cladogram", 
  alpha = 1/length(phylogenies)
)

From this plot, it can be seen which phylogeny configurations pop up most in the posterior.

Compare original, sampled species tree, to the posterior

To refresh our mind, the true sampled species tree is:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)

Comparing example 1 and 2

Both examples showed a histogram of the error between true species tree and posterior trees. We expect these errors to have a different distribution: example #1 assumed instant speciation, which suits the algorithm best, where example #2 has a strong protractedness.

It can be observed that these error distributions have a different median. This means that there is an observable higher error being made when the true species tree is protracted.

This visualization does not tell use how the error is made: is the error concentrated at the root, middle or tips of the tree? To locate this, the average nLTT plots of the posterior is shown in figure :

In the next examples, we add replicates to the data.

Example 3: Weak protractedness with replicates

Doing the rest:

wiritttes::do_simulation(filenames[3])
wiritttes::do_simulation(filenames[4])

This example is an extension of the examples 1, with the differences that:

From the (same) incipient species tree, multiple species trees are sampled
Per species tree, multiple DNA alignments are simulated
Per DNA alignment, multiple BEAST2 runs are performed

Incipient species tree

The first step is to simulate an incipient species tree. Because the RNG seed is at the same value as example #1, exactly the same incipient species tree will be obtained.

Species tree

It has little use to sample multiple species trees from an incipient species tree, as these are identical for instant speciation models. In this example, we do sample a species trees twice from the true incipient species tree. This results in two identical species trees.

Alignment

From these (in this case identical) species trees, multiple alignments can be simulated. In this example, for every species tree, there are two alignments simulated. Because in this example, there are two species sampled, and every species tree has two alignments simulated, this results in four alignments.

Each alignment is different.

Posterior

In this example, for each alignment, we do two BEAST2 runs, instead of one. It is expected (or: it should be the case) that both runs create a similar posterior. Because this example has 2 species trees, 2 alignments per species tree and 2 BEAST2 runs per alignment, 8 different-yet-similar posteriors are expected.

Do posteriors contain paraphylies?

Never, which is expected, as there is no protractedness

Seperate nLTT plots

Each its nLTT plot is compared to the original species tree, which is shown in figure [fig:Example-3-both-nLTTs]:

In all cases, we can observe that the lines are close, but do not match. Also this is expected, due to stochasticity in the MCMC sampling.

nLTT plots of BEAST runs

To see how well the posteriors match, the nLTT plots of the BEAST runs are plotted in the same graphs:

The BEAST2 runs appear to have converged to a similar posterior.

nLTT plots of alignments

Do the different alignments matter?

The different alignments do matter.

nLTT plots of species trees

Do the different species tree matter?

A bit.

Errors

For each tree in each of the posteriors, the error is put in a histogram as shown by figure [fig:Example-3-nltt-stats]:

But this does not tell us where the errors have been made: near the crown, near the tips, or in between? To do so, we plot the average nLTT plot of the posterior and compare it to the true species tree its nLTT, as shown in figure [fig:Example-3-posterior-average-nltt]:

Example 4: Strong protractedness with replicates

This example is an extension of the previous examples, with the differences that:

From the (same) incipient species tree, multiple species trees are sampled
Per species tree, multiple DNA alignments are simulated
Per DNA alignment, multiple BEAST2 runs are performed

incipient species tree

The first step is to simulate an incipient species tree. Because the RNG seed is at the same value as example #1, exactly the same incipient species tree will be obtained, as can be seen in figure [fig:Example-4-gene-tree]:

Species tree

As there are multiple sub-species present for the same species, it makes sense to sample multiple species trees from an incipient species tree. In this example, we do sample a species trees twice from the true incipient species tree. This results in two different species trees, as can be verified by figure [fig:Example-4-species-tree].

It is a bit of bad luck that the random number generator picked two very similar species trees: would, instead of

incipient species tree and all species trees

Doing this multiple times:

Alignment

Figure [fig:Example-4-alignments] shows a visualisation of the simulated alignments of our example.

Each alignment is different.

Posterior

Last tree

The last trees picked from the eight different posterior are shown in figure [fig:Example-4-posterior-sample-tree].

The species tree picked from the posterior all are different, but not too different (figure [fig:Example-4-species-tree]).

Do posteriors contain paraphylies?

Never BEAST2 creates monophyletic trees.

Seperate nLTT plots

Each its nLTT plot is compared to the original species tree, which is shown in figure [fig:Example-4-both-nLTTs].

In all cases, we can observe that the lines are close, but do not match. Also this is expected, due to stochasticity in the MCMC sampling.

nLTT plots of BEAST runs

To see how well the posteriors match, the nLTT plots of the BEAST runs are plotted in the same graphs:

The BEAST2 runs appear to have converged to a similar posterior.

nLTT plots of alignments

Do the different alignments matter?

The different alignments do matter.

nLTT plots of species trees

Do the different species tree matter?

A bit.

Error

For each tree in each of the posteriors, the error is put in a histogram as shown by figure [fig:Example-4-nltt-stats].

Comparing example 3 and 4

Both examples showed a histogram of the error between true species tree and posterior trees. We expect these errors to have a different distribution: example 3 assumed instant speciation, which suits the algorithm best, where example 4 has a strong protractedness. Plotting both histograms in the same plot results in figure [fig:Example-34-both-nltt-stats]:

It can be observed that these error distributions have a different median. This means that there is an observable higher error being made when the true species tree is protracted.

Because of the number of repeats, there are more data points, as all are lumped together.

References

[1] Janzen, Thijs, Sebastian Höhna, and Rampal S. Etienne. "Approximate Bayesian Computation of diversification rates from molecular phylogenies: introducing a new efficient summary statistic, the nLTT." Methods in Ecology and Evolution 6.5 (2015): 566-575.
[2] Pybus, Oliver G., and Paul H. Harvey. "Testing macro–evolutionary models using incomplete molecular phylogenies." Proceedings of the Royal Society of London B: Biological Sciences 267.1459 (2000): 2267-2272.

Cleaning up

file.remove(filenames)
file.remove(dir(pattern="analyse_toy_examples_"))

richelbilderbeek/wiritttes documentation built on May 27, 2019, 8:14 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

richelbilderbeek/wiritttes WIRITTE simulation

In richelbilderbeek/wiritttes: WIRITTE simulation

Workflow

Starting up

Creating parameter files

Example 1: Weak protractedness

Simulate one incipient species tree per parameter file

Once or more often, sample unique species from an incipient species tree. Also add an outgroup

Simulate one or more alignments per species tree

Once or more often, let BEAST2 infer a posterior per alignment

Compare original, sampled species tree, to the posterior

Example 2: Strong protractedness

Simulate one incipient species tree per parameter file

Once or more often, sample unique species from an incipient species tree. Also add an outgroup

Simulate one or more alignments per species tree

Once or more often, let BEAST2 infer a posterior per alignment

Compare original, sampled species tree, to the posterior

Comparing example 1 and 2

Example 3: Weak protractedness with replicates

Incipient species tree

Species tree

Alignment

Posterior

Do posteriors contain paraphylies?

Seperate nLTT plots

nLTT plots of BEAST runs

The BEAST2 runs appear to have converged to a similar posterior.

Do the different alignments matter?

nLTT plots of species trees

Errors

Example 4: Strong protractedness with replicates

incipient species tree

Species tree

Alignment

Posterior

Do posteriors contain paraphylies?

Seperate nLTT plots

nLTT plots of BEAST runs

nLTT plots of alignments

nLTT plots of species trees

Error

Comparing example 3 and 4

References

Cleaning up

R Package Documentation

Browse R Packages

We want your feedback!

richelbilderbeek/wiritttes
WIRITTE simulation