This document shows some work-out examples.
These worked-out examples show:
Every experiment has the following steps:
We will need to load some libraries:
library(ape) library(wiritttes) library(ggplot2) library(nLTT) library(phangorn) options(warn = 2)
We also set the resolution of our analysis:
dt <- 0.001
I use 0.1
while developing code,
0.001
for a decent analysis.
This worked-out example will show:
For that, I will create four parameter files:
analyse_toy_examples_1.RDa
: protractedness is absentanalyse_toy_examples_2.RDa
: protractedness stronganalyse_toy_examples_3.RDa
: protractedness is absent for multiple replicatesanalyse_toy_examples_4.RDa
: protractedness is strong for multiple replicatesHere the parameter files are created with the desired parameters:
filenames <- paste0("analyse_toy_examples_", seq(1, 4), ".RDa") wiritttes::create_test_parameter_files(filenames = filenames)
analyse_toy_examples_1.RDa
: protractedness is absent, thus speciation completion rate is high, in this case 1.0e6
analyse_toy_examples_2.RDa
: protractedness strong, thus speciation completion rate is low, inde this case 1.0e-1
analyse_toy_examples_3.RDa
: protractedness is absent for multiple replicates, thus number of sampled species trees, alignments per species trees, and BEAST2 runs are all 2analyse_toy_examples_4.RDa
: protractedness is strong for multiple replicates, thus number of sampled species trees, alignments per species trees, and BEAST2 runs are all 2Here the parameters are shown in a nicer format:
for (filename in filenames) { knitr::kable(wiritttes::read_file(filename)$parameters) }
The parameter settings used in this example are identical, except for their speciation completion rate and number of replicates.
The experimental setup of this research has multiple steps, which we will follow closely here.
These worked-out examples show the data produced in its raw form and does not care too much about aesthetics.
This example answers the question: what is the base level error of the analyses in this research?
The base level error can be obtained by using parameters for a constant-rate birth-death model. All tools used assume this model, but there will be noise (thus error) added in the process.
The parameter settings of example 1 have a high speciation completion rate \lambda, which makes the constant-rate protracted speciation model fall back to a constant-rate birth-death model, as incipient species become good species (close to) instantaneously.
Here we simulate the 'true' incipient species tree:
filename <- filenames[1] testit::assert(wiritttes::is_valid_file(filename)) wiritttes::add_pbd_output(filename)
This is how the incipient species tree looks like:
colors <- stats::setNames(c("gray", "black"), c("i", "g")) testit::assert(length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label) > 0) phytools::plotSimmap( wiritttes::read_file(filename)$pbd_output$igtree.extant, colors = colors ) n_taxa_istree <- length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label)
There are r n_taxa_istree
taxa in the incipient species tree.
All taxa are good species, so the
edges of the phylogeny have only one color. The taxon labels are S[genus]-[species]-[sub-species]'.
Now, we plot the nLTT plot of this phylogeny:
nltt_values <- get_nltt_values( phylogenies = list(wiritttes::read_file(filename)$pbd_output$tree), dt = 0.001 ) qplot( t, nltt, data = nltt_values, geom = "blank", ylim = c(0,1), main = "Example #1" ) + stat_summary( fun.data = "mean_cl_boot", color = "red", geom = "smooth" )
Note that get_nltt_values
is supplied a list of the one
incipient species tree. This is because get_nltt_values
works on a list of phylogenies.
From that incipient species tree, we create a species tree by sampling one individual per incipient species. Because speciation is for a constant-rate birth death model, there exist no multiple individuals per species. To being able to root our phylogenies in later steps, an outgroup is added as well.
wiritttes::add_species_trees(filename)
Here we observe the sample species tree:
stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1) ape::plot.phylo(stree) nLTT::nltt_plot(stree) n_taxa_sstree <- length(wiritttes::read_file(filename)$species_trees[[1]][[1]]$tip.label)
In the sampled species tree, there are r n_taxa_sstree
taxa,
where in the incipient species tree we had r n_taxa_istree
taxa.
Note that in example #3, there will be multiple sampled species trees.
Knowing the evolutionary distances between species, DNA sequence alignments can be simulated fitting the species tree with outgroup. To do so, the parameters for sequence length and mutation rate are used.
Here we simulate one DNA sequence on our one sampled species tree with outgroup:
add_alignments(filename)
Note that this research assumes a simple Jukes-Cantor model, and does so as well in later steps.
Let's see the alignment:
alignment <- get_alignment(file = read_file(filename), sti = 1, ai = 1) ape::image.DNAbin(alignment) n_taxa_alignment <- length( labels( get_alignment_by_index( file = wiritttes::read_file(filename), i = 1 ) ) )
In the alignment, there are r n_taxa_alignment
taxa,
where in the sampled species tree, there are r n_taxa_sstree
taxa.
With BEAST2 we can now obtain a posterior. A posterior consists of a representative sample of all possible trees (and parameter estimates), yet with more probable trees being present more often.
Here we let BEAST2 do so:
add_posteriors(filename)
After running BEAST2 on our DNA sequence, the full posterior must be verified to be eligible for further analysis. Using Tracer, we can open the .log file generated by BEAST2, which has to be done manually.
It can be seen that the values for ESS (Effective Sample Size) are above 200 and that the trace log shows a well-mixed chain. An ESS of 200 is used as a minimum in this research.
A posterior contains many phylogenies.
phylogenies <- wiritttes::get_posteriors(wiritttes::read_file(filename))[[1]][[1]]$trees # To get the densiTree function working, phylogenies must be of class multiphylo class(phylogenies) <- "multiPhylo"
Then, these posterior trees are plotted:
densiTree( phylogenies, type = "cladogram", alpha = 1/length(phylogenies) )
From this plot, it can be seen which phylogeny configurations pop up most in the posterior.
To refresh our mind, the true sampled species tree is:
stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1) ape::plot.phylo(stree) nLTT::nltt_plot(stree)
Now that the full posterior is assumed to be correct, first, I will now highlight one of its posterior states, before going back to the full picture.
In this case, I choose a random state to zoom in on. From this last state, I show the tree only:
# plot_posterior_samples(filename)
This random tree may be very different by chance, as unlikely trees are present, yet in low abundances.
The original sampled tree is very probably different from this random posterior tree.
How different then?
The package wiritttea
analyses exactly this.
This example answers the question: how do the analyses of this research look like for strong protractedness?
The pipeline is identical, except one parameter is changed: the speciation completion rate is set to a low value.
Here we simulate the 'true' incipient species tree:
filename <- filenames[2] testit::assert(wiritttes::is_valid_file(filename)) add_pbd_output(filename)
This is how the incipient species tree looks like:
colors <- stats::setNames(c("gray", "black"), c("i", "g")) testit::assert(length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label) > 0) phytools::plotSimmap( wiritttes::read_file(filename)$pbd_output$igtree.extant, colors = colors )
Now, we plot the nLTT plot of this phylogeny:
nltt_values <- get_nltt_values( phylogenies = list(wiritttes::read_file(filename)$pbd_output$tree), dt = 0.001 ) qplot( t, nltt, data = nltt_values, geom = "blank", ylim = c(0,1), main = "Example #2 incipient species tree" ) + stat_summary( fun.data = "mean_cl_boot", color = "red", geom = "smooth" )
From that incipient species tree, we create a species tree by sampling one individual per incipient species. Because speciation is for a constant-rate birth death model, there exist no multiple individuals per species. To being able to root our phylogenies in later steps, an outgroup is added as well.
add_species_trees(filename)
Here we observe the sample species tree:
stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1) ape::plot.phylo(stree) nLTT::nltt_plot(stree)
Note that in example #4, there will be multiple sampled species trees.
Here we simulate one DNA sequence on our one sampled species tree with outgroup:
add_alignments(filename)
Let's see the alignment:
alignment <- get_alignment(file = read_file(filename), sti = 1, ai = 1) ape::image.DNAbin(alignment)
Here we let BEAST2 infer a posterior:
add_posteriors(filename)
A posterior contains many phylogenies. To visualize these,
they are extracted from a .trees
file here:
phylogenies <- wiritttes::get_posteriors(wiritttes::read_file(filename))[[1]][[1]]$trees # To get the densiTree function working, phylogenies must be of class multiphylo class(phylogenies) <- "multiPhylo"
Then, these posterior trees are plotted:
densiTree( phylogenies, type = "cladogram", alpha = 1/length(phylogenies) )
From this plot, it can be seen which phylogeny configurations pop up most in the posterior.
To refresh our mind, the true sampled species tree is:
stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1) ape::plot.phylo(stree) nLTT::nltt_plot(stree)
Both examples showed a histogram of the error between true species tree and posterior trees. We expect these errors to have a different distribution: example #1 assumed instant speciation, which suits the algorithm best, where example #2 has a strong protractedness.
It can be observed that these error distributions have a different median. This means that there is an observable higher error being made when the true species tree is protracted.
This visualization does not tell use how the error is made: is the error concentrated at the root, middle or tips of the tree? To locate this, the average nLTT plots of the posterior is shown in figure :
In the next examples, we add replicates to the data.
Doing the rest:
wiritttes::do_simulation(filenames[3]) wiritttes::do_simulation(filenames[4])
This example is an extension of the examples 1, with the differences that:
The first step is to simulate an incipient species tree. Because the RNG seed is at the same value as example #1, exactly the same incipient species tree will be obtained.
It has little use to sample multiple species trees from an incipient species tree, as these are identical for instant speciation models. In this example, we do sample a species trees twice from the true incipient species tree. This results in two identical species trees.
From these (in this case identical) species trees, multiple alignments can be simulated. In this example, for every species tree, there are two alignments simulated. Because in this example, there are two species sampled, and every species tree has two alignments simulated, this results in four alignments.
Each alignment is different.
In this example, for each alignment, we do two BEAST2 runs, instead of one. It is expected (or: it should be the case) that both runs create a similar posterior. Because this example has 2 species trees, 2 alignments per species tree and 2 BEAST2 runs per alignment, 8 different-yet-similar posteriors are expected.
Never, which is expected, as there is no protractedness
Each its nLTT plot is compared to the original species tree, which is shown in figure [fig:Example-3-both-nLTTs]:
In all cases, we can observe that the lines are close, but do not match. Also this is expected, due to stochasticity in the MCMC sampling.
To see how well the posteriors match, the nLTT plots of the BEAST runs are plotted in the same graphs:
nLTT plots of alignments
The different alignments do matter.
Do the different species tree matter?
A bit.
For each tree in each of the posteriors, the error is put in a histogram as shown by figure [fig:Example-3-nltt-stats]:
But this does not tell us where the errors have been made: near the crown, near the tips, or in between? To do so, we plot the average nLTT plot of the posterior and compare it to the true species tree its nLTT, as shown in figure [fig:Example-3-posterior-average-nltt]:
This example is an extension of the previous examples, with the differences that:
The first step is to simulate an incipient species tree. Because the RNG seed is at the same value as example #1, exactly the same incipient species tree will be obtained, as can be seen in figure [fig:Example-4-gene-tree]:
As there are multiple sub-species present for the same species, it makes sense to sample multiple species trees from an incipient species tree. In this example, we do sample a species trees twice from the true incipient species tree. This results in two different species trees, as can be verified by figure [fig:Example-4-species-tree].
It is a bit of bad luck that the random number generator picked two very similar species trees: would, instead of
incipient species tree and all species trees
Doing this multiple times:
From these (in this case identical) species trees, multiple alignments can be simulated. In this example, for every species tree, there are two alignments simulated. Because in this example, there are two species sampled, and every species tree has two alignments simulated, this results in four alignments.
Figure [fig:Example-4-alignments] shows a visualisation of the simulated alignments of our example.
Each alignment is different.
In this example, for each alignment, we do two BEAST2 runs, instead of one. It is expected (or: it should be the case) that both runs create a similar posterior. Because this example has 2 species trees, 2 alignments per species tree and 2 BEAST2 runs per alignment, 8 different-yet-similar posteriors are expected.
Last tree
The last trees picked from the eight different posterior are shown in figure [fig:Example-4-posterior-sample-tree].
The species tree picked from the posterior all are different, but not too different (figure [fig:Example-4-species-tree]).
Never BEAST2 creates monophyletic trees.
Each its nLTT plot is compared to the original species tree, which is shown in figure [fig:Example-4-both-nLTTs].
In all cases, we can observe that the lines are close, but do not match. Also this is expected, due to stochasticity in the MCMC sampling.
To see how well the posteriors match, the nLTT plots of the BEAST runs are plotted in the same graphs:
The BEAST2 runs appear to have converged to a similar posterior.
Do the different alignments matter?
The different alignments do matter.
Do the different species tree matter?
A bit.
For each tree in each of the posteriors, the error is put in a histogram as shown by figure [fig:Example-4-nltt-stats].
But this does not tell us where the errors have been made: near the crown, near the tips, or in between? To do so, we plot the average nLTT plot of the posterior and compare it to the true species tree its nLTT, as shown in figure [fig:Example-4-posterior-average-nltt].
Both examples showed a histogram of the error between true species tree and posterior trees. We expect these errors to have a different distribution: example 3 assumed instant speciation, which suits the algorithm best, where example 4 has a strong protractedness. Plotting both histograms in the same plot results in figure [fig:Example-34-both-nltt-stats]:
It can be observed that these error distributions have a different median. This means that there is an observable higher error being made when the true species tree is protracted.
Because of the number of repeats, there are more data points, as all are lumped together.
This visualization does not tell use how the error is made: is the error concentrated at the root, middle or tips of the tree? To locate this, the average nLTT plots of the posterior is shown in figure [fig:Examples-34-posterior-average-nltt]:
file.remove(filenames) file.remove(dir(pattern="analyse_toy_examples_"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.