expand: Expand an existing classification tree.
In insect: Informatic Sequence Classification Trees

expand

R Documentation

Expand an existing classification tree.

Description

This function is used to grow an existing classification tree, typically using more relaxed parameter settings than those used when the tree was created, or if fine-scale control over the tree-learning operation is required.

Usage

expand(
  tree,
  clades = "0",
  refine = "Viterbi",
  iterations = 50,
  nstart = 20,
  minK = 2,
  maxK = 2,
  minscore = 0.9,
  probs = 0.5,
  retry = TRUE,
  resize = TRUE,
  maxsize = 1000,
  recursive = TRUE,
  cores = 1,
  quiet = FALSE,
  verbose = FALSE,
  ...
)

Arguments

`tree`	an object of class `"insect"`.
`clades`	a vector of character strings giving the binary indices matching the labels of the nodes that are to be expanded. Defaults to "0", meaning all subclades are expanded. See below for further details on clade indexing.
`refine`	character string giving the iterative model refinement method to be used in the partitioning process. Valid options are `"Viterbi"` (Viterbi training; the default option) and `"BaumWelch"` (a modified version of the Expectation-Maximization algorithm).
`iterations`	integer giving the maximum number of training-classification iterations to be used in the splitting process. Note that this is not necessarily the same as the number of Viterbi training or Baum Welch iterations to be used in model training, which can be set using the argument `"maxiter"` (eventually passed to `train` via the dots argument "...").
`nstart`	integer. The number of random starting sets to be chosen for initial k-means assignment of sequences to groups. Defaults to 20.
`minK`	integer. The minimum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating).
`maxK`	integer. The maximum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating).
`minscore`	numeric between 0 and 1. The minimum acceptable value for the nth percentile of Akaike weights (where n is the value given in `"probs"`, for the node to be split and the recursion process to continue. At any given node, if the nth percentile of Akaike weights falls below this threshold, the recursion process for the node will terminate. As an example, if `minscore = 0.95` and `probs = 0.5` (the default settings), and after generating two candidate PHMMs to occupy the candidate subnodes the median Akaike weight is less than 0.95, the splitting process will terminate and the function will simply return the unsplit root node.
`probs`	numeric between 0 and 1. The percentile of Akaike weights to test against the minimum score threshold given in `"minscore"`.
`retry`	logical indicating whether failure to split a node based on the criteria outlined in 'minscore' and 'probs' should prompt a second attempt with different initial groupings. These groupings are based on maximum kmer frequencies rather than k-means division, which can give suboptimal groupings when the cluster sizes are different (due to the up-weighting of larger clusters in the k-means algorithm).
`resize`	logical indicating whether the models should be free to change size during the training process or if the number of modules should be fixed. Defaults to TRUE. Only applicable if `refine = "Viterbi"`.
`maxsize`	integer giving the upper bound on the number of modules in the PHMMs. If NULL, no maximum size is enforced.
`recursive`	logical indicating whether the splitting process should continue recursively until the discrimination criteria are not met (TRUE; default), or whether a single split should take place at each of the nodes specified in `clades`.
`cores`	integer giving the number processors for multithreading. Defaults to 1. This argument may alternatively be a 'cluster' object, in which case it is the user's responsibility to close the socket connection at the conclusion of the operation, e.g. by running `parallel::stopCluster(cores)`. The string 'autodetect' is also accepted, in which case the maximum number of cores to use is one less than the total number of cores available.
`quiet`	logical indicating whether feedback should be printed to the console.
`verbose`	logical indicating whether extra feedback should be printed to the console, including progress at each split.
`...`	further arguments to be passed on to `train`).

Details

The clade indexing system used here is based on character strings, where "0" refers to the root node, "01" is the first child node, "02" is the second child node, "011" is the first child node of the first child node, etc. Note that this means each node cannot have more than 9 child nodes.

Value

an object of class "insect".

Author(s)

Shaun Wilkinson

Examples


  data(whales)
  data(whale_taxonomy)
  ## split the first node
  set.seed(123)
  tree <- learn(whales, db = whale_taxonomy, recursive = FALSE)
  ## expand only the first clade
  tree <- expand(tree, clades = "1")

insect documentation built on June 8, 2025, 10:37 a.m.