allocateCVI: Allocate sequences for cross validation by identity.

Description Usage Arguments Value Author(s) References Examples

View source: R/cvi.R

Description

This function takes a reference sequence database and allocates each sequence to either a query set (a.k.a. test set) or a training set, in order to cross validate a supervised taxon classifier. The method is based on that of Edgar (2018), but uses recursive divisive clustering and retains all sequences rather than discarding those that violate the top-hit identity constraint.

Usage

1
allocateCVI(x, threshold = 0.9, allocate = "max", ...)

Arguments

x

a set of reference sequences. Can be a "DNAbin" object or a named vector of upper-case DNA character strings.

threshold

numeric between 0 and 1 giving the identity threshold for sequence allocation.

allocate

character giving the method to use to allocate eligible sequences to the query set. Options are "max" (default) which chooses the largest node from each pair in order to maximize the size of the query set, or "sample", which randomly chooses one node from each eligible pair.

...

further arguments to pass to "kmeans"

Value

a logical vector the same length as the input object, indicating which sequences should be allocated to the query set

Author(s)

Shaun Wilkinson

References

Edgar RC (2018) Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences. PeerJ 6:e4652. DOI 10.7717/peerj.4652

Examples

1
2

shaunpwilkinson/insect documentation built on Aug. 9, 2021, 5 a.m.