Description Usage Arguments Details Value Author(s) References Examples
This function learns a classification tree from a reference sequence database using a recursive partitioning procedure.
1 2 3 4 
x 
an object of class 
db 
a heirarchical taxonomy database in the form of a data.frame.
The object should have
four columns, labeled "taxID", "parent_taxID", "rank" and "name".
The first two should be numeric, and all ID numbers in the
"parent_taxID" column should link to those in the "taxID" column.
This excludes the first row,
which should have 
model 
an optional object of class 
refine 
character string giving the iterative model refinement
method to be used in the partitioning process. Valid options are

iterations 
integer giving the maximum number of trainingclassification
iterations to be used in the splitting process.
Note that this is not necessarily the same as the number of Viterbi training
or Baum Welch iterations to be used in model training, which can be set
using the argument 
nstart 
integer. The number of random starting sets to be chosen for initial kmeans assignment of sequences to groups. Defaults to 20. 
minK 
integer. The minimum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating). 
maxK 
integer. The maximum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating). 
minscore 
numeric between 0 and 1. The minimum acceptable value
for the nth percentile of Akaike weights (where n is
the value given in 
probs 
numeric between 0 and 1. The percentile of Akaike weights
to test against the minimum score threshold given in 
retry 
logical indicating whether failure to split a node based on the criteria outlined in 'minscore' and 'probs' should prompt a second attempt with different initial groupings. These groupings are based on maximum kmer frequencies rather than kmeans division, which can give suboptimal groupings when the cluster sizes are different (due to the upweighting of larger clusters in the kmeans algorithm). 
resize 
logical indicating whether the models should be free to
change size during the training process or if the number of modules
should be fixed. Defaults to TRUE. Only applicable if

maxsize 
integer giving the upper bound on the number of modules in the PHMMs. If NULL (default) no maximum size is enforced. 
recursive 
logical indicating whether the splitting process should continue recursively until the discrimination criteria are not met (TRUE; default), or whether a single split should take place at the root node. 
cores 
integer giving the number of CPUs to use
when training the models (only applicable if

quiet 
logical indicating whether feedback should be printed to the console. 
verbose 
logical indicating whether extra feedback should be printed to the console, including progress at each split. 
... 
further arguments to be passed on to 
The "insect" object type is a dendrogram
with several additional attributes stored at each node.
These include:
"clade" the index of the node (see further details below);
"sequences" the indices of the sequences in the reference
database used to create the object;
"taxID" the taxonomic identifier of the lowest common taxon
of the sequences belonging to the node (linking to "db"
);
"minscore" the lowest likelihood among the training sequences given
the profile HMM stored at the node;
"minlength" the minimum length of the sequences belonging to the node;
"maxlength" the maximum length of the sequences belonging to the node;
"model" the profile HMM derived from the sequence subset belonging to the node;
"nunique" the number of unique sequences belonging to the node;
"ntotal" the total number of sequences belonging to the node (including duplicates);
"key" the hash key used for exact sequence matching
(bypasses the classification procedure if an exact match is found; root node only);
"taxonomy" the taxonomy database containing the taxon ID numbers (root node only).
The clade indexing system used here is based on character strings, where "0" refers to the root node, "01" is the first child node, "02" is the second child node, "011" is the first child node of the first child node, etc. The leading zero may be omitted for brevity. Note that each inner node can not have more than 9 child nodes.
an object of class "insect"
.
Shaun Wilkinson
Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms for Molecular Biology, 5, 21.
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Gerstein M, Sonnhammer ELL, Chothia C (1994) Volume changes in protein evolution. Journal of Molecular Biology, 236, 10671078.
Juang BH, Rabiner LR (1990) The segmental Kmeans algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38, 16391641.
1 2 3 4 5 6 7 8 9 10  data(whales)
data(whale_taxonomy)
## use all sequences except first one to train the classifier
set.seed(999)
tree < learn(whales[1], db = whale_taxonomy, maxiter = 5, cores = 2)
## find predicted lineage for first sequence
classify(whales[1], tree)
## compare with actual lineage
taxID < as.integer(gsub(".+\\", "", names(whales)[1]))
get_lineage(taxID, whale_taxonomy)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.