classify: Tree-based sequence classification.
In shaunpwilkinson/insect: Informatic Sequence Classification Trees

Description Usage Arguments Details Value Author(s) References See Also Examples

"classify" assigns taxon IDs to DNA sequences using an informatic sequence classification tree.

classify(
  x,
  tree,
  threshold = 0.8,
  decay = FALSE,
  ping = 0.98,
  mincount = 5,
  offset = 0,
  ranks = c("kingdom", "phylum", "class", "order", "family", "genus", "species"),
  species = "ping100",
  tabulize = FALSE,
  metadata = FALSE,
  cores = 1
)

`x`	a sequence or set of sequences. Can be a "DNAbin" or "AAbin" object or a named vector of upper-case DNA character strings.
`tree`	an object of class `"insect"` (see `learn` for details).
`threshold`	numeric between 0 and 1 giving the minimum Akaike weight for the recursive classification procedure to continue toward the leaves of the tree. Defaults to 0.8.
`decay`	logical indicating whether the decision to terminate the classification process should be made based on decaying Akaike weights (at each node, the Akaike weight of the selected model is multiplied by the Akaike weight of the selected model at the parent node) or whether each Akaike weight should be calculated independently of that of the parent node. Defaults to FALSE (the latter).
`ping`	logical or numeric (between 0 and 1) indicating whether a nearest neighbor search should be carried out, and if so, what the minimum distance to the nearest neighbor should be for the the recursive classification algorithm to be skipped. If TRUE and the query sequence is identical to at least one of the training sequences used to learn the tree, the common ancestor of the matching training sequences is returned with an score of NA. If a value between 0 and 1 is provided, the common ancestor of the training sequences with similarity greater than or equal to 'ping' is returned, again with a score of NA. If `ping` is set to 0 or FALSE, the recursive classification algorithm is applied to all sequences, regardless of proximity to those in the training set. For high values (e.g. `ping >= 0.98`) the output will generally specify the taxonomic ID to species or genus level; however a higher rank may be returned for low-resolution genetic markers.
`mincount`	integer, the minimum number of training sequences belonging to a selected child node for the classification to progress. Defaults to 5.
`offset`	log-odds score offset parameter governing whether the minimum score is met at each node. Defaults to 0. Values above 0 increase precision (fewer type I errors), values below 0 increase recall (fewer type II errors).
`ranks`	character vector giving the taxonomic ranks to be included in the output table. Must be a valid rank from the taxonomy database attributed to the classification tree (`attr(tree, "taxonomy")`). Set to NULL to exclude taxonomic ranks from the output table.
`species`	character string, indicating whether to include all species-level classifications in the output (species = 'all'), only those generated by exact matching ("ping100"; the default setting), only those generated by exact matching or near-neighbor searching (species = 'ping'). If `species = "ping"` or `species = "ping100"`, non-matched species are returned at genus level. Alternatively, if species = 'none', all species-level classifications are returned at genus level.
`tabulize`	logical indicating whether sequence counts should be attached to the output table. If TRUE, the output table will have one row for each unique sequence, and columns will include counts for each sample (where samples names precede sequence identifiers in the input object; see details below).
`metadata`	logical indicating whether to include additional columns containing the paths, individual node scores and reasons for termination. Defaults to FALSE. Included for advanced use and debugging.
`cores`	integer giving the number of processors for multithreading (defaults to 1). This argument may alternatively be a 'cluster' object, in which case it is the user's responsibility to close the socket connection at the conclusion of the operation, for example by running `parallel::stopCluster(cores)`. The string 'autodetect' is also accepted, in which case the maximum number of cores to use is one less than the total number of cores available.

This function requires a pre-computed classification tree of class "insect", which is a dendrogram object with additional attributes (see learn for details). Query sequences obtained from the same primer set used to construct the tree are classified to produce taxonomic IDs with an associated degree of confidence. The classification algorithm works as follows: starting from the root node of the tree, the log-likelihood of the query sequence (the log-probability of the sequence given a particular model) is computed for each of the models occupying the two child nodes using the forward algorithm (see Durbin et al. (1998)). The competing likelihood values are then compared by computing their Akaike weights (Johnson and Omland, 2004). If one model is overwhelmingly more likely to have produced the sequence than the other, that child node is chosen and the classification is updated to reflect the taxonomic ID stored at the node. This classification procedure is repeated, continuing down the tree until either an inconclusive result is returned by a model comparison test (i.e. the Akaike weight is lower than a pre-defined threshold, e.g. 0.9), or a terminal leaf node is reached, at which point a species-level classification is generally returned. The function outputs a table with one row for each input sequence Output table fields include "name" (the unique sequence identifier), "taxID" (the taxonomic identification number from the taxonomy database), "taxon" (the name of the taxon), "rank" (the rank of the taxon, e.g. species, genus family, etc), and "score" (the Akaike weight from the model selection procedure). Note that the default behavior is for the Akaike weight to ‘decay’ as it moves down the tree, by computing the cumulative product of all preceding Akaike weight values. This minimizes the chance of type I taxon ID errors (overclassifications and misclassifications). The output table also includes the higher taxonomic ranks specified in the ranks argument, and if metadata = TRUE additional columns are included called "path" (the path of the sequence through the classification tree), "scores" (the scores at each node through the tree, UTF-8-encoded), and "reason" outlining why the recursive classification procedure was terminated:

0 reached leaf node
1 failed to meet minimum score threshold at inner node
2 failed to meet minimum score of training sequences at inner node
3 sequence length shorter than minimum length of training sequences at inner node
4 sequence length exceeded maximum length of training sequences at inner node
5 nearest neighbor in training set does not belong to selected node (obsolete)
6 node is supported by too few sequences
7 reserved
8 sequence could not be translated (amino acids only)
9 translated sequence contains stop codon(s) (amino acids only)

Additional columns detailing the nearest neighbor search include "NNtaxID", "NNtaxon", "NNrank", and "NNdistance".

a data.frame.

Shaun Wilkinson

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Johnson JB, Omland KS (2004) Model selection in ecology and evolution. Trends in Ecology and Evolution. 19, 101-108.

learn

  data(whales)
  data(whale_taxonomy)
  ## use all sequences except first one to train the classifier
  set.seed(999)
  tree <- learn(whales[-1], db = whale_taxonomy, maxiter = 5, cores = 2)
  ## find predicted lineage for first sequence
  classify(whales[1], tree)
  ## compare with actual lineage
  taxID <- as.integer(gsub(".+\\|", "", names(whales)[1]))
  get_lineage(taxID, whale_taxonomy)