multinomTrain | R Documentation |
Training the multinomial K-mer method on sequence data.
multinomTrain(sequence, taxon, K = 5, col.names = FALSE, n.pseudo = 1)
sequence |
Character vector of sequences. |
taxon |
Character vector of taxon labels for each sequence. |
K |
Word length (integer). |
col.names |
Logical indicating if column names (K-mers) should be added to the trained model matrix. |
n.pseudo |
Number of pseudo-counts to use (positive numerics, need not be integer). Special case -1 will only return word counts, not log-probabilities. |
The training step of the multinomial method (Vinje et al, 2015) means counting K-mers
on all sequences and compute their multinomial probabilities for each taxon.
n.pseudo
pseudo-counts are added equally to all K-mers, before probabilities
are estimated. The optimal choice of n.pseudo
will depend on K
and the
training data set.
Adding the actual K-mers as column names (col.names = TRUE
) will slow down the
computations.
The relative taxon frequencies in the taxon
input are also computed and
returned as an attribute to the probability matrix.
A matrix with the multinomial probabilities, one row for each
taxon
and one column for each K-mer. The sum of each row is 1.0. No
probabilities are 0 if n.pseudo
> 0.0.
The matrix has an attribute attr("prior",)
, that contains the relative
taxon frequencies.
Kristian Hovde Liland and Lars Snipen.
Vinje, H, Liland, KH, Almøy, T, Snipen, L. (2015). Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics, 16:205.
KmerCount
, multinomClassify
.
# See examples for multinomClassify
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.