This document shows some work-out examples.

These worked-out examples show:

Workflow

Every experiment has the following steps:

Starting up

We will need to load some libraries:

library(ape)
library(wiritttes)
library(ggplot2)
library(nLTT)
library(phangorn)
options(warn = 2)

We also set the resolution of our analysis:

dt <- 0.001

I use 0.1 while developing code, 0.001 for a decent analysis.

Creating parameter files

This worked-out example will show:

For that, I will create four parameter files:

Here the parameter files are created with the desired parameters:

filenames <- paste0("analyse_toy_examples_", seq(1, 4), ".RDa")
wiritttes::create_test_parameter_files(filenames = filenames)

Here the parameters are shown in a nicer format:

for (filename in filenames) {
  knitr::kable(wiritttes::read_file(filename)$parameters)
}

The parameter settings used in this example are identical, except for their speciation completion rate and number of replicates.

The experimental setup of this research has multiple steps, which we will follow closely here.

These worked-out examples show the data produced in its raw form and does not care too much about aesthetics.

Example 1: Weak protractedness

This example answers the question: what is the base level error of the analyses in this research?

The base level error can be obtained by using parameters for a constant-rate birth-death model. All tools used assume this model, but there will be noise (thus error) added in the process.

The parameter settings of example 1 have a high speciation completion rate \lambda, which makes the constant-rate protracted speciation model fall back to a constant-rate birth-death model, as incipient species become good species (close to) instantaneously.

Simulate one incipient species tree per parameter file

Here we simulate the 'true' incipient species tree:

filename <- filenames[1]
testit::assert(wiritttes::is_valid_file(filename))
wiritttes::add_pbd_output(filename)

This is how the incipient species tree looks like:

colors <- stats::setNames(c("gray", "black"), c("i", "g"))
testit::assert(length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label) > 0)
phytools::plotSimmap(
  wiritttes::read_file(filename)$pbd_output$igtree.extant, 
  colors = colors
)
n_taxa_istree <- length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label)

There are r n_taxa_istree taxa in the incipient species tree. All taxa are good species, so the edges of the phylogeny have only one color. The taxon labels are S[genus]-[species]-[sub-species]'.

Now, we plot the nLTT plot of this phylogeny:

nltt_values <- get_nltt_values(
  phylogenies = list(wiritttes::read_file(filename)$pbd_output$tree), 
  dt = 0.001
)
qplot(
  t, nltt, data = nltt_values, geom = "blank", ylim = c(0,1),
  main = "Example #1"
) + stat_summary(
  fun.data = "mean_cl_boot", color = "red", geom = "smooth"
)

Note that get_nltt_values is supplied a list of the one incipient species tree. This is because get_nltt_values works on a list of phylogenies.

Once or more often, sample unique species from an incipient species tree. Also add an outgroup

From that incipient species tree, we create a species tree by sampling one individual per incipient species. Because speciation is for a constant-rate birth death model, there exist no multiple individuals per species. To being able to root our phylogenies in later steps, an outgroup is added as well.

wiritttes::add_species_trees(filename)  

Here we observe the sample species tree:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)
n_taxa_sstree <- length(wiritttes::read_file(filename)$species_trees[[1]][[1]]$tip.label)

In the sampled species tree, there are r n_taxa_sstree taxa, where in the incipient species tree we had r n_taxa_istree taxa.

Note that in example #3, there will be multiple sampled species trees.

Simulate one or more alignments per species tree

Knowing the evolutionary distances between species, DNA sequence alignments can be simulated fitting the species tree with outgroup. To do so, the parameters for sequence length and mutation rate are used.

Here we simulate one DNA sequence on our one sampled species tree with outgroup:

add_alignments(filename)  

Note that this research assumes a simple Jukes-Cantor model, and does so as well in later steps.

Let's see the alignment:

alignment <- get_alignment(file = read_file(filename), sti = 1, ai = 1)
ape::image.DNAbin(alignment)

n_taxa_alignment <- length(
  labels(
    get_alignment_by_index(
      file = wiritttes::read_file(filename), 
      i = 1
    )
  )
)

In the alignment, there are r n_taxa_alignment taxa, where in the sampled species tree, there are r n_taxa_sstree taxa.

Once or more often, let BEAST2 infer a posterior per alignment

With BEAST2 we can now obtain a posterior. A posterior consists of a representative sample of all possible trees (and parameter estimates), yet with more probable trees being present more often.

Here we let BEAST2 do so:

add_posteriors(filename)

After running BEAST2 on our DNA sequence, the full posterior must be verified to be eligible for further analysis. Using Tracer, we can open the .log file generated by BEAST2, which has to be done manually.

It can be seen that the values for ESS (Effective Sample Size) are above 200 and that the trace log shows a well-mixed chain. An ESS of 200 is used as a minimum in this research.

A posterior contains many phylogenies.

phylogenies <- wiritttes::get_posteriors(wiritttes::read_file(filename))[[1]][[1]]$trees
# To get the densiTree function working, phylogenies must be of class multiphylo
class(phylogenies) <- "multiPhylo"

Then, these posterior trees are plotted:

densiTree(
  phylogenies, 
  type = "cladogram", 
  alpha = 1/length(phylogenies)
)

From this plot, it can be seen which phylogeny configurations pop up most in the posterior.

Compare original, sampled species tree, to the posterior

To refresh our mind, the true sampled species tree is:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)

Now that the full posterior is assumed to be correct, first, I will now highlight one of its posterior states, before going back to the full picture.

In this case, I choose a random state to zoom in on. From this last state, I show the tree only:

# plot_posterior_samples(filename)

This random tree may be very different by chance, as unlikely trees are present, yet in low abundances.

The original sampled tree is very probably different from this random posterior tree.

How different then?

The package wiritttea analyses exactly this.

Example 2: Strong protractedness

This example answers the question: how do the analyses of this research look like for strong protractedness?

The pipeline is identical, except one parameter is changed: the speciation completion rate is set to a low value.

Simulate one incipient species tree per parameter file

Here we simulate the 'true' incipient species tree:

filename <- filenames[2]
testit::assert(wiritttes::is_valid_file(filename))
add_pbd_output(filename)

This is how the incipient species tree looks like:

colors <- stats::setNames(c("gray", "black"), c("i", "g"))
testit::assert(length(wiritttes::read_file(filename)$pbd_output$igtree.extant$tip.label) > 0)
phytools::plotSimmap(
  wiritttes::read_file(filename)$pbd_output$igtree.extant, 
  colors = colors
)

Now, we plot the nLTT plot of this phylogeny:

nltt_values <- get_nltt_values(
  phylogenies = list(wiritttes::read_file(filename)$pbd_output$tree), 
  dt = 0.001
)
qplot(
  t, nltt, data = nltt_values, geom = "blank", ylim = c(0,1),
  main = "Example #2 incipient species tree"
) + stat_summary(
  fun.data = "mean_cl_boot", color = "red", geom = "smooth"
)

Once or more often, sample unique species from an incipient species tree. Also add an outgroup

From that incipient species tree, we create a species tree by sampling one individual per incipient species. Because speciation is for a constant-rate birth death model, there exist no multiple individuals per species. To being able to root our phylogenies in later steps, an outgroup is added as well.

add_species_trees(filename)  

Here we observe the sample species tree:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)

Note that in example #4, there will be multiple sampled species trees.

Simulate one or more alignments per species tree

Here we simulate one DNA sequence on our one sampled species tree with outgroup:

add_alignments(filename)  

Let's see the alignment:

alignment <- get_alignment(file = read_file(filename), sti = 1, ai = 1)
ape::image.DNAbin(alignment)

Once or more often, let BEAST2 infer a posterior per alignment

Here we let BEAST2 infer a posterior:

add_posteriors(filename)

A posterior contains many phylogenies. To visualize these, they are extracted from a .trees file here:

phylogenies <- wiritttes::get_posteriors(wiritttes::read_file(filename))[[1]][[1]]$trees
# To get the densiTree function working, phylogenies must be of class multiphylo
class(phylogenies) <- "multiPhylo"

Then, these posterior trees are plotted:

densiTree(
  phylogenies, 
  type = "cladogram", 
  alpha = 1/length(phylogenies)
)

From this plot, it can be seen which phylogeny configurations pop up most in the posterior.

Compare original, sampled species tree, to the posterior

To refresh our mind, the true sampled species tree is:

stree <- wiritttes::get_species_tree_by_index(read_file(filename), sti = 1)
ape::plot.phylo(stree)
nLTT::nltt_plot(stree)

Comparing example 1 and 2

Both examples showed a histogram of the error between true species tree and posterior trees. We expect these errors to have a different distribution: example #1 assumed instant speciation, which suits the algorithm best, where example #2 has a strong protractedness.

It can be observed that these error distributions have a different median. This means that there is an observable higher error being made when the true species tree is protracted.

This visualization does not tell use how the error is made: is the error concentrated at the root, middle or tips of the tree? To locate this, the average nLTT plots of the posterior is shown in figure :

In the next examples, we add replicates to the data.

Example 3: Weak protractedness with replicates

Doing the rest:

wiritttes::do_simulation(filenames[3])
wiritttes::do_simulation(filenames[4])

This example is an extension of the examples 1, with the differences that:

Incipient species tree

The first step is to simulate an incipient species tree. Because the RNG seed is at the same value as example #1, exactly the same incipient species tree will be obtained.

Species tree

It has little use to sample multiple species trees from an incipient species tree, as these are identical for instant speciation models. In this example, we do sample a species trees twice from the true incipient species tree. This results in two identical species trees.

Alignment

From these (in this case identical) species trees, multiple alignments can be simulated. In this example, for every species tree, there are two alignments simulated. Because in this example, there are two species sampled, and every species tree has two alignments simulated, this results in four alignments.

Each alignment is different.

Posterior

In this example, for each alignment, we do two BEAST2 runs, instead of one. It is expected (or: it should be the case) that both runs create a similar posterior. Because this example has 2 species trees, 2 alignments per species tree and 2 BEAST2 runs per alignment, 8 different-yet-similar posteriors are expected.

Do posteriors contain paraphylies?

Never, which is expected, as there is no protractedness

Seperate nLTT plots

Each its nLTT plot is compared to the original species tree, which is shown in figure [fig:Example-3-both-nLTTs]:

In all cases, we can observe that the lines are close, but do not match. Also this is expected, due to stochasticity in the MCMC sampling.

nLTT plots of BEAST runs

To see how well the posteriors match, the nLTT plots of the BEAST runs are plotted in the same graphs:

The BEAST2 runs appear to have converged to a similar posterior.

nLTT plots of alignments

Do the different alignments matter?

The different alignments do matter.

nLTT plots of species trees

Do the different species tree matter?

A bit.

Errors

For each tree in each of the posteriors, the error is put in a histogram as shown by figure [fig:Example-3-nltt-stats]:

But this does not tell us where the errors have been made: near the crown, near the tips, or in between? To do so, we plot the average nLTT plot of the posterior and compare it to the true species tree its nLTT, as shown in figure [fig:Example-3-posterior-average-nltt]:

Example 4: Strong protractedness with replicates

This example is an extension of the previous examples, with the differences that:

incipient species tree

The first step is to simulate an incipient species tree. Because the RNG seed is at the same value as example #1, exactly the same incipient species tree will be obtained, as can be seen in figure [fig:Example-4-gene-tree]:

Species tree

As there are multiple sub-species present for the same species, it makes sense to sample multiple species trees from an incipient species tree. In this example, we do sample a species trees twice from the true incipient species tree. This results in two different species trees, as can be verified by figure [fig:Example-4-species-tree].

It is a bit of bad luck that the random number generator picked two very similar species trees: would, instead of

incipient species tree and all species trees

Doing this multiple times:

Alignment

From these (in this case identical) species trees, multiple alignments can be simulated. In this example, for every species tree, there are two alignments simulated. Because in this example, there are two species sampled, and every species tree has two alignments simulated, this results in four alignments.

Figure [fig:Example-4-alignments] shows a visualisation of the simulated alignments of our example.

Each alignment is different.

Posterior

In this example, for each alignment, we do two BEAST2 runs, instead of one. It is expected (or: it should be the case) that both runs create a similar posterior. Because this example has 2 species trees, 2 alignments per species tree and 2 BEAST2 runs per alignment, 8 different-yet-similar posteriors are expected.

Last tree

The last trees picked from the eight different posterior are shown in figure [fig:Example-4-posterior-sample-tree].

The species tree picked from the posterior all are different, but not too different (figure [fig:Example-4-species-tree]).

Do posteriors contain paraphylies?

Never BEAST2 creates monophyletic trees.

Seperate nLTT plots

Each its nLTT plot is compared to the original species tree, which is shown in figure [fig:Example-4-both-nLTTs].

In all cases, we can observe that the lines are close, but do not match. Also this is expected, due to stochasticity in the MCMC sampling.

nLTT plots of BEAST runs

To see how well the posteriors match, the nLTT plots of the BEAST runs are plotted in the same graphs:

The BEAST2 runs appear to have converged to a similar posterior.

nLTT plots of alignments

Do the different alignments matter?

The different alignments do matter.

nLTT plots of species trees

Do the different species tree matter?

A bit.

Error

For each tree in each of the posteriors, the error is put in a histogram as shown by figure [fig:Example-4-nltt-stats].

But this does not tell us where the errors have been made: near the crown, near the tips, or in between? To do so, we plot the average nLTT plot of the posterior and compare it to the true species tree its nLTT, as shown in figure [fig:Example-4-posterior-average-nltt].

Comparing example 3 and 4

Both examples showed a histogram of the error between true species tree and posterior trees. We expect these errors to have a different distribution: example 3 assumed instant speciation, which suits the algorithm best, where example 4 has a strong protractedness. Plotting both histograms in the same plot results in figure [fig:Example-34-both-nltt-stats]:

It can be observed that these error distributions have a different median. This means that there is an observable higher error being made when the true species tree is protracted.

Because of the number of repeats, there are more data points, as all are lumped together.

This visualization does not tell use how the error is made: is the error concentrated at the root, middle or tips of the tree? To locate this, the average nLTT plots of the posterior is shown in figure [fig:Examples-34-posterior-average-nltt]:

References

Cleaning up

file.remove(filenames)
file.remove(dir(pattern="analyse_toy_examples_"))


richelbilderbeek/wiritttes documentation built on May 27, 2019, 8:14 a.m.