IdTaxa: Assign Sequences a Taxonomic Classification

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/IdTaxa.R

Description

Classifies sequences according to a training set by assigning a confidence to taxonomic labels for each taxonomic level.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
IdTaxa(test,
       trainingSet,
       type = "extended",
       strand = "both",
       threshold = 60,
       bootstraps = 100,
       samples = L^0.47,
       minDescend = 0.98,
       fullLength = 0,
       processors = 1,
       verbose = TRUE)

Arguments

test

An AAStringSet, DNAStringSet, or RNAStringSet of unaligned sequences.

trainingSet

An object of class Taxa and subclass Train.

type

Character string indicating the type of output desired. This should be (an abbreviation of) one of "extended" or "collapsed". (See value section below.)

strand

Character string indicating the orientation of the test sequences relative to the trainingSet. This should be (an abbreviation of) one of "both", "top", or "bottom". The top strand is defined as the input test sequences being in the same orientation as the trainingSet, and the bottom strand is its reverse complement orientation. The default of "both" will classify using both orientations and choose the result with highest confidence. Choosing the correct strand will make classification over 2-fold faster, assuming that all of the reads are in the same orientation. Note that strand is ignored when test is an AAStringSet.

threshold

Numeric specifying the confidence at which to truncate the output taxonomic classifications. Lower values of threshold will classify deeper into the taxonomic tree at the expense of accuracy, and vise-versa for higher values of threshold.

bootstraps

Integer giving the maximum number of bootstrap replicates to perform for each sequence. The number of bootstrap replicates is set automatically such that (on average) 99% of k-mers are sampled in each test sequence.

samples

A function or call written as a function of ‘L’, which will evaluate to a numeric vector the same length as ‘L’. Typically of the form “A + B*L^C”, where ‘A’, ‘B’, and ‘C’ are constants.

minDescend

Numeric giving the minimum fraction of bootstraps required to descend the tree during the initial tree descend phase of the algorithm. Higher values are less likely to descend the tree, causing direct comparison against more sequences in the trainingSet. Lower values may increase classification speed at the expense of accuracy. Suggested values are between 1.0 and 0.9.

fullLength

Numeric specifying the fold-difference in sequence lengths between sequences in test and trainingSet that is allowable, or 0 (the default) to consider all sequences in trainingSet regardless of length. Can be specified as either a single numeric (> 1), or two numerics specifying the upper and lower fold-difference. If fullLength is between 0 and 1 (exclusive), the fold-difference is inferred from the length variability among sequences belonging to each class based on the foldDifference quantiles. For example, setting fullLength to 0.99 would use the 1st and 99th percentile of intra-group length variability from the trainingSet. In the case of full-length sequences, specifying fullLength can improve both speed and accuracy by using sequence length as a pre-filter to classification. Note that fullLength should only be greater than 0 when both the test and trainingSet consist of full-length sequences.

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

verbose

Logical indicating whether to display progress.

Details

Sequences in test are each assigned a taxonomic classification based on the trainingSet created with LearnTaxa. Each taxonomic level is given a confidence between 0% and 100%, and the taxonomy is truncated where confidence drops below threshold. If the taxonomic classification was truncated, the last group is labeled with “unclassified_” followed by the final taxon's name. Note that the reported confidence is not a p-value but does directly relate to a given classification's probability of being wrong. The default threshold of 60% is intended to minimize the rate of incorrect classifications. Lower values of threshold (e.g., 50%) may be preferred to increase the taxonomic depth of classifications.

Value

If type is "extended" (the default) then an object of class Taxa and subclass Train is returned. This is stored as a list with elements corresponding to their respective sequence in test. Each list element contains components:

taxon

A character vector containing the taxa to which the sequence was assigned.

confidence

A numeric vector giving the corresponding percent confidence for each taxon.

rank

If the classifier was trained with a set of ranks, a character vector containing the rank name of each taxon.

If type is "collapsed" then a character vector is returned with the taxonomic assignment for each sequence. This takes the repeating form “Taxon name [rank, confidence%]; ...” if ranks were supplied during training, or “Taxon name [confidence%]; ...” otherwise.

Author(s)

Erik Wright eswright@pitt.edu

References

Murali, A., et al. (2018). IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences. Microbiome, 6, 140. https://doi.org/10.1186/s40168-018-0521-5

See Also

LearnTaxa, Taxa-class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
data("TrainingSet_16S")

# import test sequences
fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)

# remove any gaps in the sequences
dna <- RemoveGaps(dna)

# classify the test sequences
ids <- IdTaxa(dna, TrainingSet_16S, strand="top")
ids

# view the results
plot(ids, TrainingSet_16S)

Example output

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package:BiocGenericsThe following objects are masked frompackage:parallel:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked frompackage:stats:

    IQR, mad, sd, var, xtabs

The following objects are masked frompackage:base:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package:S4VectorsThe following object is masked frompackage:base:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package:BiostringsThe following object is masked frompackage:base:

    strsplit

Loading required package: RSQLite
================================================================================

Time difference of 11.31 secs

  A test set of class 'Taxa' with length 175
      confidence name                 taxon
  [1]        63% uncultured bacter... Root; Bacteria; Firmicutes; Bacilli; Ba...
  [2]        68% uncultured bacter... Root; Bacteria; Firmicutes; Bacilli; Ba...
  [3]        62% uncultured bacter... Root; Bacteria; Firmicutes; Bacilli; Ba...
  [4]        92% uncultured bacter... Root; Bacteria; Firmicutes; Bacilli; La...
  [5]        62% uncultured bacter... Root; Bacteria; Firmicutes; Clostridia;...
  ...        ... ...                  ...
[171]        38% uncultured bacter... Root; unclassified_Root                   
[172]        49% uncultured bacter... Root; unclassified_Root                   
[173]        31% uncultured bacter... Root; unclassified_Root                   
[174]        48% uncultured bacter... Root; unclassified_Root                   
[175]        51% uncultured bacter... Root; unclassified_Root                   

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.