learn: Informatic sequence classification tree learning.

Description Usage Arguments Details Value Author(s) References Examples

View source: R/learn.R


This function learns a classification tree from a reference sequence database using a recursive partitioning procedure.


learn(x, db = NULL, model = NULL, refine = "Viterbi",
  iterations = 50, nstart = 20, minK = 2, maxK = 2,
  minscore = 0.9, probs = 0.5, retry = TRUE, resize = TRUE,
  maxsize = 1000, recursive = TRUE, cores = 1, quiet = FALSE,
  verbose = FALSE, numcode = NULL, frame = NULL, ...)



a reference database of class"DNAbin" representing a list of DNA sequences to be used as the training data. All sequences should be from the same genetic region of interest and be globally alignable (i.e. without unjustified end-gaps). The sequences must have "names" attributes, either in RDP format (containing semicolon-delimited lineage strings), or that include taxonomic ID numbers corresponding with those in the taxonomy database db (separated from the sequence ID by a "|" character). For example: "AF296347|30962", "AF296346|8022", "AF296345|8017", etc. See searchGB for more details on creating the reference sequence database and taxonomy for the associated heirarchical taxonomic database.


a heirarchical taxonomy database in the form of a data.frame. Cannot be NULL unless training data is in RDP format (containing semicolon delimited lineage strings). The object should have four columns, labeled "taxID", "parent_taxID", "rank" and "name". The first two should be numeric, and all ID numbers in the "parent_taxID" column should link to those in the "taxID" column. This excludes the first row, which should have parent_taxID = 0 and name = "root". See taxonomy for more details.


an optional object of class "PHMM" providing the starting parameters. Used to train (optimize parameters for) subsequent nested models to be positioned at successive sub-nodes. If NULL, the root model is derived from the sequence list prior to the recursive partitioning process.


character string giving the iterative model refinement method to be used in the partitioning process. Valid options are "Viterbi" (Viterbi training; the default option) and "BaumWelch" (a modified version of the Expectation-Maximization algorithm).


integer giving the maximum number of training-classification iterations to be used in the splitting process. Note that this is not necessarily the same as the number of Viterbi training or Baum Welch iterations to be used in model training, which can be set using the argument "maxiter" (eventually passed to train via the dots argument "...").


integer. The number of random starting sets to be chosen for initial k-means assignment of sequences to groups. Defaults to 20.


integer. The minimum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating).


integer. The maximum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating).


numeric between 0 and 1. The minimum acceptable value for the nth percentile of Akaike weights (where n is the value given in "probs", for the node to be split and the recursion process to continue. At any given node, if the nth percentile of Akaike weights falls below this threshold, the recursion process for the node will terminate. As an example, if minscore = 0.9 and probs = 0.5 (the default settings), and after generating two candidate PHMMs to occupy the candidate subnodes the median of Akaike weights is 0.89, the splitting process will terminate and the function will simply return the unsplit root node.


numeric between 0 and 1. The percentile of Akaike weights to test against the minimum score threshold given in "minscore".


logical indicating whether failure to split a node based on the criteria outlined in 'minscore' and 'probs' should prompt a second attempt with different initial groupings. These groupings are based on maximum kmer frequencies rather than k-means division, which can give suboptimal groupings when the cluster sizes are different (due to the up-weighting of larger clusters in the k-means algorithm).


logical indicating whether the models should be free to change size during the training process or if the number of modules should be fixed. Defaults to TRUE. Only applicable if refine = "Viterbi".


integer giving the upper bound on the number of modules in the PHMMs. If NULL, no maximum size is enforced.


logical indicating whether the splitting process should continue recursively until the discrimination criteria are not met (TRUE; default), or whether a single split should take place at the root node.


integer giving the number processors for multithreading. Defaults to 1. This argument may alternatively be a 'cluster' object, in which case it is the user's responsibility to close the socket connection at the conclusion of the operation, e.g. by running parallel::stopCluster(cores). The string 'autodetect' is also accepted, in which case the maximum number of cores to use is one less than the total number of cores available.


logical indicating whether feedback should be printed to the console.


logical indicating whether extra feedback should be printed to the console, including progress at each split.

numcode, frame

passed to translate. Set to NULL (default) unless learning a hybrid DNA/amino acid sequence classifier.


further arguments to be passed on to train).


The "insect" object type is a dendrogram with several additional attributes stored at each node. These include: "clade" the index of the node (see further details below); "sequences" the indices of the sequences in the reference database used to create the object; "taxID" the taxonomic identifier of the lowest common taxon of the sequences belonging to the node (linking to "db"); "minscore" the lowest likelihood among the training sequences given the profile HMM stored at the node; "minlength" the minimum length of the sequences belonging to the node; "maxlength" the maximum length of the sequences belonging to the node; "model" the profile HMM derived from the sequence subset belonging to the node; "nunique" the number of unique sequences belonging to the node; "ntotal" the total number of sequences belonging to the node (including duplicates); "key" the hash key used for exact sequence matching (bypasses the classification procedure if an exact match is found; root node only); "taxonomy" the taxonomy database containing the taxon ID numbers (root node only).

The clade indexing system used here is based on character strings, where "0" refers to the root node, "01" is the first child node, "02" is the second child node, "011" is the first child node of the first child node, etc. The leading zero may be omitted for brevity. Note that each inner node can not have more than 9 child nodes.


an object of class "insect".


Shaun Wilkinson


Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms for Molecular Biology, 5, 21.

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Gerstein M, Sonnhammer ELL, Chothia C (1994) Volume changes in protein evolution. Journal of Molecular Biology, 236, 1067-1078.

Juang B-H, Rabiner LR (1990) The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38, 1639-1641.


  ## use all sequences except first one to train the classifier
  tree <- learn(whales[-1], db = whale_taxonomy, maxiter = 5, cores = 2)
  ## find predicted lineage for first sequence
  classify(whales[1], tree)
  ## compare with actual lineage
  taxID <- as.integer(gsub(".+\\|", "", names(whales)[1]))
  get_lineage(taxID, whale_taxonomy)

shaunpwilkinson/insect documentation built on Dec. 2, 2018, 7:37 p.m.