Description Usage Arguments Details Value Note Author(s) References See Also Examples
Trains a classifier based on a reference taxonomy containing sequence representatives assigned to taxonomic groups.
1 2 3 4 5 6 7 8 9 10 11 12 |
train |
An |
taxonomy |
Character string providing the reference taxonomic assignment for each sequence in |
rank |
Optionally, a |
K |
Integer specifying the k-mer size or |
N |
Numeric indicating the approximate number of k-mers that can be randomly selected before one is found by chance on average. For example, the default value of |
minFraction |
Numeric giving the minimum fraction of k-mers to sample during the initial tree descent phase of the classification algorithm. (See details section below.) |
maxFraction |
Numeric giving the maximum fraction of k-mers to sample during the initial tree descent phase of the classification algorithm. (See details section below.) |
maxIterations |
Integer specifying the maximum number of iterations to attempt re-classification of a training sequence before declaring it a “problem sequence”. (See details section below.) |
multiplier |
Numeric indicating the degree to which individual sequences have control over the fraction of k-mers sampled at any edge during the initial tree descent phase of the classification algorithm. (See details section below.) |
maxChildren |
Integer giving the maximum number of child taxa of any taxon at which to consider further descending the taxonomic tree. A value of |
alphabet |
Character vector of amino acid groupings used to reduce the 20 standard amino acids into smaller groups. Alphabet reduction helps to find more distant homologies between sequences. A non-reduced amino acid alphabet can be used by setting |
verbose |
Logical indicating whether to display progress. |
Learning about the training data is a two part process consisting of (i) forming a taxonomic tree and then (ii) ensuring that the train
ing sequences can be correctly reclassified. The latter step relies on reclassifying the sequences in train
by descending the taxonomic tree, a process termed “tree descent”. Ultimately, the goal of tree descent is to quickly and accurately narrow the selection of groups where a sequence may belong. During the learning process, tree descent is tuned so that it performs well when classifying new sequences.
The process of training the classifier first involves learning the taxonomic tree spanning all of the reference sequences in train
. Typically, reference taxonomic classifications are provided by an authoritative source, oftentimes along with a “taxid” file containing taxonomic rank
information. The taxonomic tree may contain any number of levels (e.g., Root, Phylum, Class, Order, Family, Genus) as long as they are hierarchically nested and always begin with “Root”.
The second phase of training the classifier, tree descent, involves learning the optimal set of k-mers for discerning between the different sub-groups under each edge. Here a fraction of the k-mers with the greatest discerning power are matched to a training sequence, and this process is repeated with 100 random subsamples to decide on the set of possible taxonomic groups to which a training sequence may belong.
The learning process works by attempting to correctly re-classify each training sequence in the taxonomy. Initially, maxFraction
of informative k-mers are repeatedly sampled at each edge during tree descent. Training sequences that are incorrectly classified at an edge will lower the fraction of k-mers that are sampled by an amount that is proportional to multiplier
. As the fraction of sampled k-mers decreases, the tree descent process terminates at higher rank levels.
A major advantage of tree descent is that it both speeds up the classification process and indicates where the training set likely contains mislabeled sequences or incorrectly-placed taxonomic groups. Training sequences that are not correctly classified within maxIterations
are marked as “problem sequences”, because it is likely that they are mislabeled. If enough sequences have difficulty being correctly classified at an edge that the fraction drops below minFraction
, then the edge is recorded as a “problem group”.
The final result is an object that can be used for classification with IdTaxa
, as well as information about train
that could be used to help correct any errors in the taxonomy.
An object of class Taxa
and subclass Train, which is stored as a list with components:
taxonomy |
A character vector containing all possible groups in the taxonomy. |
taxa |
A character vector containing the basal taxon in each taxonomy. |
ranks |
A character vector of rank names for each taxon, or |
levels |
Integer giving the rank level of each taxon. |
children |
A list containing the index of all children in the taxonomy for each taxon. |
parents |
An integer providing the index of the parent for each taxon. |
fraction |
A numeric between |
sequences |
List containing the integer indices of sequences in |
kmers |
List containing the unique sorted k-mers (converted to integers) belonging to each sequence in |
crossIndex |
Integer indicating the index in taxonomy of each sequence's taxonomic label. |
K |
The value of |
IDFweights |
Numeric vector of length |
decisionKmers |
List of informative k-mers and their associated relative frequencies for each internal edge in the taxonomy. |
problemSequences |
A |
problemGroups |
Character vector containing any taxonomic groups that repeatedly had problems with correctly re-classifying sequences in |
alphabet |
The |
If K
is NULL
, the automatically determined value of K
might be too large on some machines, resulting in an error. In such cases it is recommended that K
be set manually to a smaller value.
Erik Wright eswright@pitt.edu
Murali, A., et al. (2018). IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences. Microbiome, 6, 140. https://doi.org/10.1186/s40168-018-0521-5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # import training sequences
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
# parse the headers to obtain a taxonomy
s <- strsplit(names(dna), " ")
genus <- sapply(s, `[`, 1)
species <- sapply(s, `[`, 2)
taxonomy <- paste("Root", genus, species, sep="; ")
head(taxonomy)
# train the classifier
## Not run:
trainingSet <- LearnTaxa(dna, taxonomy)
trainingSet
# view information about the classifier
plot(trainingSet)
# train the classifier with amino acid sequences
aa <- translate(dna)
trainingSetAA <- LearnTaxa(aa, taxonomy)
trainingSetAA
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.