SuperTree: Create a Species Tree from Gene Trees

View source: R/SuperTree.R

SuperTreeR Documentation

Create a Species Tree from Gene Trees

Description

Given a set of unrooted gene trees, creates a species tree. This function works for rooted gene trees, but may not accurately root the resulting tree.

Usage

SuperTree(myDendList, NAMEFUN=NULL, Verbose=TRUE, Processors=1)

Arguments

myDendList

List of dendrogram objects, where each entry is an unrooted gene tree.

NAMEFUN

Optional input specifying a function to apply to each leaf to convert gene tree leaf labels into species names. This function should take as input a character vector and return a character vector of the same size. By default equals NULL, indicating that gene tree leaves are already labeled with species identifiers. See details for more information.

Verbose

Should output be displayed?

Processors

Number of processors to use for calculating the final species tree.

Details

This implementation follows the ASTRID algorithm for estimating a species tree from a set of unrooted gene trees. Input gene trees are not required to have identical species sets, as the algorithm can handle missing entries in gene trees. The algorithm essentially works by averaging the Cophenetic distance matrices of all gene trees, then constructing a neighbor-joining tree from the resulting distance matrix. See the original paper linked in the references section for more information.

If two species never appear together in a gene tree, their distance cannot be estimated in the algorithm and will thus be missing. SuperTree handles this by imputing the value using the distances available with data-interpolating empirical orthogonal functions (DINEOF). This approach has relatively high accuracy even up to high levels of missingness. Eigenvector calculation speed is improved using a Lanczos algorithm for matrix compression.

SuperTree allows an optional argument called NAMEFUN to apply a renaming step to leaf labels. Gene trees as constructed by other functions in SynExtend (ex. DisjointSet) often include other information aside from species name when labeling genes, but SuperTree requires that leaf nodes of the gene tree are labeled with just an identifier corresponding to which species/genome each leaf is from. Duplicate values are allowed. See the examples section for more details on what this looks like and how to handle it.

Value

A dendrogram object corresponding to the species tree constructed from input gene trees.

Author(s)

Aidan Lakshman ahl27@pitt.edu

References

Vachaspati, P., Warnow, T. ASTRID: Accurate Species TRees from Internode Distances. BMC Genomics, 2015. 16 (Suppl 10): S3.

Taylor, M.H., Losch, M., Wenzel, M. and Schröter, J. On the sensitivity of field reconstruction and prediction using empirical orthogonal functions derived from gappy data. Journal of Climate, 2013. 26(22): 9194-9205.

See Also

TreeLine, SuperTreeEx

Examples

# Loads a list of dendrograms
# each is a gene tree from Streptomyces genomes
data("SuperTreeEx", package="SynExtend")

# Notice that the labels of the tree are in #_#_# format
# See the man page for SuperTreeEx for more info
labs <- labels(exData[[1]])
labs

# The first number corresponds to the species,
# so we need to trim the rest in each leaf label
namefun <- function(x) gsub("([0-9A-Za-z]*)_.*", "\\1", x)
namefun(labs) # trims to just first number

# This function replaces gene identifiers with species identifiers
# we pass it to NAMEFUN
# Note NAMEFUN should take in a character vector and return a character vector
tree <- SuperTree(exData, NAMEFUN=namefun)

npcooley/SynExtend documentation built on May 2, 2024, 7:28 p.m.