View source: R/tree.dist.matrix.R
tree.dist.matrix | R Documentation |
This function takes a list of trees and returns a distance matrix populated with distances between all trees in the list.
tree.dist.matrix(trees, treedist = "rf", ...)
trees |
an object of class 'multiPhylo'. |
treedist |
acronym of distance method to employ: one of |
... |
additional parameters sent to distance functions. |
a matrix of distances between each pair of trees
A suite of distance metrics are implemented, offering a trade-off between running time and suitability of metric. Ranked according to their running time with 251 85-tip trees on a low-spec desktop computer, recommended distance metrics are:
- rf
(0.4 seconds): Robinson-Foulds distance (Robinson & Foulds,
1981): although widely used, the RF metric has a series of theoretical
shortcomings that give rise to bias and artefacts, translating to poor
performance in a suite of practical applications. Its low resolution and
rapid saturation make it particularly unsuitable for operations in tree
space. Nevertheless, its speed is hard to match.
- pd
(1 s): The path (= cladistic / nodal / patristic / tip)
distance (Farris 1973) can also be calculated rapidly, but is heavily
influenced by the shape (e.g. balanced / unbalanced) of trees, meaning that
similar-looking trees that nevertheless denote very different sets of
relationships will have shorter distances than may be anticipated.
Consequently, the path metric does a poor job of identifying clusters of
similar trees.
- icrf
(1.4 s): Robinson-Foulds distance, corrected for split size
using information theory (Smith 2020). This measure adjusts the
Robinson-Foulds distance to account for the different significance of
different partitions: partitions that evenly divide taxa contain more
information, and thus should contribute more to a distance score if they
are not shared between trees. This adjustment improves the resolution and
sensitivity of the metric, but does not correct for a number of arguably
more significant biases.
- pid
(4 s); msid
(8 s); cid
(30 s):
Phylogenetic information distance, matching split information
distance and clustering information distance (Smith 2020).
These information-theoretic methods belong to the class of
Generalized Robinson-Foulds distances:
by recognizing similarities between non-identical splits, they overcome
many of the artefacts that affect the RF distance, providing a more
representative measure of tree distances; whereas their information-theoretic
basis affords them a natural unit (the bit), providing a measurable dimension
to tree space.
Whilst the CID performs the best against a suite of theoretical and
practical criteria, the MSID comes a very close second and is somewhat
quicker to calculate. The PID will provide unexpectedly large distances
in a subset of the cases that distort the RF metric, which may result in
undesirable distortions of an accompanying tree space.
Detailed analysis of the behaviour of these and other tree distance methods against a suite of criteria is available in Smith (2020); implementation details are provided in the R package 'TreeDist'.
A further set of methods that underperform methods with similar running time listed above are also implemented for comparative purposes:
- mast
, masti
(30 minutes): size / information content of
the maximum agreement forest, subtracted from its maximum possible value
to create a distance. Specify 'rooted = FALSE' if trees are unrooted.
- jrf
(1 min, k = 2; 25 min, conflict-ok; 4 h, k = 4, no-conflict):
Jaccard Robinson-Foulds metric (Böcker et al. 2013);
specify a value of k
and allowConflict
using ...
.
- ms
(5 s): Matching split distance (Bogdanowicz and Giaro 2012;
Lin et al. 2012.
- nye
(65 s): The generalized RF distance of Nye et al. (2006).
- nni
(65 s): Approximate Nearest Neighbour Interchance (rotation)
distance.
- spr
(0.4 s): Approximate Subtree Prune and Regraft distance.
Böcker S, Canzar S, Klau GW (2013). “The generalized Robinson-Foulds metric.” In Darling A, Stoye J (eds.), Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science, vol 8126, 156–169. Springer, Berlin, Heidelberg. doi: 10.1007/978-3-642-40453-5_13.
Bogdanowicz D, Giaro K (2012). “Matching split distance for unrooted binary phylogenetic trees.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 150–160. doi: 10.1109/TCBB.2011.48.
Farris JS (1973). “On comparing the shapes of taxonomic trees.” Systematic Zoology, 22(1), 50–54. doi: 10.2307/2412378.
Lin Y, Rajan V, Moret BME (2012). “A metric for phylogenetic trees based on matching.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(9), 1014–1022. doi: 10.1109/TCBB.2011.157.
Nye TMW, Liò P, Gilks WR (2006). “A novel algorithm and web-based tool for comparing two alternative phylogenetic trees.” Bioinformatics, 22(1), 117–119. doi: 10.1093/bioinformatics/bti720.
Robinson DF, Foulds LR (1981). “Comparison of phylogenetic trees.” Mathematical Biosciences, 53(1-2), 131–147. doi: 10.1016/0025-5564(81)90043-2.
Smith MR (2020). “Information theoretic Generalized Robinson-Foulds metrics for comparing phylogenetic trees.” Bioinformatics, in production. doi: 10.1093/bioinformatics/btaa614.
## Not run:
data(fungus)
tree.dist.matrix(fungus$Fungus.Run1$trees)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.