rdpTrain: Training the RDP classifier
In microclass: Methods for Taxonomic Classification of Prokaryotes

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/rdpClassifier.R

Training the RDP presence/absence K-mer method on sequence data.

1	rdpTrain(sequence, taxon, K = 8, cnames = FALSE)

`sequence`	Character vector of 16S sequences.
`taxon`	Character vector of taxon labels for each sequence.
`K`	Word length (integer).
`cnames`	Logical indicating if column names should be added to the trained model matrix.

The training step of the RDP method means looking for K-mers on all sequences, and computing the probability of each K-mer being present for each unique taxon. This is an attempt to re-implement the method described by Wang et tal (2007), but without the bootstrapping. See that publications for all details.

The word-length K is by default 8, since this is the value used by Wang et al. Larger values may lead to memory-problems since the trained model is a matrix with 4^K columns. Adding the K-mers as column names will slow down all computations.

The relative taxon sizes are also computed, and returned as an attribute to the model matrix. They may be used as empirical priors in the classification step.

A list with two elements. The first element is Method, which is the text "RDPclassifier" in this case. The second element is Fitted, which is a matrix with one row for each unique taxon and one column for each possible word of length K. The value in row i and column j is the probability that word j is present in taxon i.

Kristian Hovde Liland and Lars Snipen.

Wang, Q, Garrity, GM, Tiedje, JM, Cole, JR (2007). Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Applied and Enviromental Microbiology, 73: 5261-5267.

rdpClassify.