In hillarykoch/TASTI: Taxa with Ancestral structure Species Tree Inference

A big data set of rooted gene tree triplets in Newick format can be loaded in as follows:

library(TASTI)
library(readr)
file <- system.file("extdata", "trees.txt", package = "TASTI")

You'll need to extract the topologies (and branch lengths, if using them) from this data. Since TASTI is based on rooted triples, the function $\texttt{get_triples}$ can do this for you. It can output data in either of the two formats used by the function $\texttt{buildST}$ to build the species tree. The output format you need depends on how you want to model -- using topologies and branch lengths, or just topologies alone.

library(ape)
library(stringr)
topos <- get_triples(file, branchLengths = FALSE)
bLens <- get_triples(file, branchLengths = TRUE)

To use $\texttt{buildST}$, the main function in TASTI, your data must be in these formats! This is what they look like:

head(topos)
head(bLens)

You can build a species tree in 3 ways.

It will take some time to do this sort of analysis on 1,000 gene trees. Some pre-computed results using all 3 methods are saved in this package. Load them in by typing:

data("results")

But how can you get something like this? First, you could only use the topologies. This method has the advantage of being the most robust, with the downside of omitting crucial information (that is, branch lengths) that can improve inference accuracy. Let's say that previous work has told us that the effective population size $N=5000$ and the mutation-scaled coalescent rate is inversely proportional to $\theta = 0.005$. Then, this is done as follows:

N <- 5000
theta = .005
fit <- buildST(topos, theta, N, gridSearch = FALSE, branchLengths = FALSE)

The results should look something like this:

results$topologies

Second, you can include the branch lengths. This increases accuracy, assuming your inferred branch lengths are reasonable. This problem can be handled in two ways. Either using numerical optimization:

N <- 5000
theta = .005
fit <- buildST(bLens, theta, N, gridSearch = FALSE, branchLengths = TRUE)

The results look like this:

results$bLens_with_optim

Or, you can compute with a grid search. A grid search can get very expensive (depending on fine your grid is) but has the advantage or searching through the non-convex likelihood surface with more accuracy. The grid search is done as follows:

N <- 5000
theta = .005
fit <- buildST(bLens, theta, N, gridSearch = TRUE, branchLengths = TRUE, mBins = 20, tauBins = 20)

where mBins and tauBins specify the number of "bins" to consider for the migrations rate and speciation times. The larger these numbers, the more accurate your inference will be, but the greater the computational cost, of course. The results, when using the bins as specified above, are:

results$bLens_with_gridSearch

In all 3 cases, we infer that the true species tree has the topology ((AB)C). The taxa seem to belong to ancestrally structured populations, with an estimated migration rate between them between 3 and 5 population-scaled migration units.

hillarykoch/TASTI documentation built on Aug. 1, 2020, 3:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com