A big data set of rooted gene tree triplets in Newick format can be loaded in as follows:
library(TASTI) library(readr) file <- system.file("extdata", "trees.txt", package = "TASTI")
You'll need to extract the topologies (and branch lengths, if using them) from this data. Since TASTI is based on rooted triples, the function $\texttt{get_triples}$ can do this for you. It can output data in either of the two formats used by the function $\texttt{buildST}$ to build the species tree. The output format you need depends on how you want to model -- using topologies and branch lengths, or just topologies alone.
library(ape) library(stringr) topos <- get_triples(file, branchLengths = FALSE) bLens <- get_triples(file, branchLengths = TRUE)
To use $\texttt{buildST}$, the main function in TASTI, your data must be in these formats! This is what they look like:
head(topos) head(bLens)
It will take some time to do this sort of analysis on 1,000 gene trees. Some pre-computed results using all 3 methods are saved in this package. Load them in by typing:
data("results")
But how can you get something like this? First, you could only use the topologies. This method has the advantage of being the most robust, with the downside of omitting crucial information (that is, branch lengths) that can improve inference accuracy. Let's say that previous work has told us that the effective population size $N=5000$ and the mutation-scaled coalescent rate is inversely proportional to $\theta = 0.005$. Then, this is done as follows:
N <- 5000 theta = .005 fit <- buildST(topos, theta, N, gridSearch = FALSE, branchLengths = FALSE)
The results should look something like this:
results$topologies
Second, you can include the branch lengths. This increases accuracy, assuming your inferred branch lengths are reasonable. This problem can be handled in two ways. Either using numerical optimization:
N <- 5000 theta = .005 fit <- buildST(bLens, theta, N, gridSearch = FALSE, branchLengths = TRUE)
The results look like this:
results$bLens_with_optim
Or, you can compute with a grid search. A grid search can get very expensive (depending on fine your grid is) but has the advantage or searching through the non-convex likelihood surface with more accuracy. The grid search is done as follows:
N <- 5000 theta = .005 fit <- buildST(bLens, theta, N, gridSearch = TRUE, branchLengths = TRUE, mBins = 20, tauBins = 20)
where mBins and tauBins specify the number of "bins" to consider for the migrations rate and speciation times. The larger these numbers, the more accurate your inference will be, but the greater the computational cost, of course. The results, when using the bins as specified above, are:
results$bLens_with_gridSearch
In all 3 cases, we infer that the true species tree has the topology ((AB)C). The taxa seem to belong to ancestrally structured populations, with an estimated migration rate between them between 3 and 5 population-scaled migration units.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.