phylter: Filter phylogenomics datasets
In phylter: Detect and Remove Outliers in Phylogenomics Datasets

phylter

R Documentation

Filter phylogenomics datasets

Description

Detection and filtering out of outliers in a list of trees or a list of distance matrices.

Usage

phylter(
  X,
  bvalue = 0,
  distance = "patristic",
  k = 3,
  k2 = k,
  Norm = "median",
  Norm.cutoff = 0.001,
  gene.names = NULL,
  test.island = TRUE,
  verbose = TRUE,
  stop.criteria = 1e-05,
  InitialOnly = FALSE,
  normalizeby = "row",
  parallel = TRUE
)

Arguments

`X`	A list of phylogenetic trees (phylo object) or a list of distance matrices. Trees can have different number of leaves and matrices can have different dimensions. If this is the case, missing values are imputed.
`bvalue`	If X is a list of trees, nodes with a support below 'bvalue' will be collapsed prior to the outlier detection.
`distance`	If X is a list of trees, type of distance used to compute the pairwise matrices for each tree. Can be "patristic" (sum of branch lengths separating tips, the default) or nodal (number of nodes separating tips). The "nodal" option should only be used if all species are present in all genes.
`k`	Strength of outlier detection. The higher this value the less outliers detected (see details).
`k2`	Same as k for complete gene outlier detection. To preserve complete genes from being discarded, k2 can be increased. By default, k2 = k (see above).
`Norm`	Should the matrices be normalized prior to the complete analysis and how. If "median" (the default), matrices are divided by their median, if "mean" they are divided by their mean, if "none", no normalization if performed. Normalizing ensures that fast-evolving (and slow-evolving) genes are not treated as outliers. Normalization by median is a better choice as it is less sensitive to outlier values.
`Norm.cutoff`	Value of the median (if `Norm="median"`) or the mean (if `Norm="mean"`) of phylogenetic distance matrices below which genes are simply discarded from the analysis. This prevents dividing by 0, and allows getting rid of genes that contain mostly branches of length 0 and are therefore uninformative anyway. Discarded genes, if any, are listed in the output `out$DiscardedGenes`.
`gene.names`	List of gene names used to rename elements in X. If NULL (the default), 0 elements are named 1,2,..,length(X).
`test.island`	If TRUE (the default), only the highest value in an 'island' of outliers is considered an outlier. This prevents non-outliers hitchhiked by outliers to be considered outliers themselves.
`verbose`	If TRUE (the default), messages are written during the filtering process to get information of what is happening
`stop.criteria`	The optimisation stops when the gain in concordance between matrices between round `n` and round `n+1` is smaller than this value. Default to 1e-5.
`InitialOnly`	Logical. If TRUE, only the Initial state of the data is computed.
`normalizeby`	Should the gene x species matrix be normalized prior to outlier detection, and how.
`parallel`	Should the computations be parallelized when possible? Default to TRUE. Note that the number of threads cannot be set by the user when 'parallel=TRUE'. It uses all available cores on the machine.

Value

A list of class 'phylter' with the 'Initial' (before filtering) and 'Final' (after filtering) states, or a list of class 'phylterinitial' only, if InitialOnly=TRUE. The function also returns the list of DiscardedGenes, if any.

Examples

data(carnivora)

# using default paramaters
res <- phylter(carnivora, parallel = FALSE) # perform the phylter analysis
res # brief summary of the analysis
res$DiscardedGenes # list of genes discarded prior to the analysis
res$Initial # See all elements prior to the analysis
res$Final # See all elements at the end of the analysis
res$Final$Outliers # Print all outliers detected


# Change the call to phylter to use nodal distances, instead of patristic: 
res <- phylter(carnivora, distance = "nodal")

phylter documentation built on Aug. 8, 2025, 6:16 p.m.