otu: Cluster sequences into operational taxonomic units.
In kmer: Fast K-Mer Counting and Clustering for Biological Sequence Analysis

otu	R Documentation

Cluster sequences into operational taxonomic units.

Description

This function performs divisive heirarchical clustering on a set of DNA sequences using sequential k-means partitioning, returning an integer vector of OTU membership.

Usage

otu(
  x,
  k = 5,
  threshold = 0.97,
  method = "central",
  residues = NULL,
  gap = "-",
  ...
)

Arguments

`x`	a "DNAbin" object.
`k`	integer giving the k-mer size used to generate the input matrix for k-means clustering.
`threshold`	numeric between 0 and 1 giving the OTU identity cutoff. Defaults to 0.97.
`method`	the maximum distance criterion to use for terminating the recursive partitioning procedure. Accepted options are "central" (splitting stops if the similarity between the central sequence and its farthest neighbor within the cluster is greater than the threshold), "centroid" (splitting stops if the similarity between the centroid and its farthest neighbor within the cluster is greater than the threshold), and "farthest" (splitting stops if the similarity between the two farthest sequences within the cluster is greater than the threshold). Defaults to "central".
`residues`	either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Specifying the residue alphabet is therefore recommended unless the sequence list is a "DNAbin" or "AAbin" object.
`gap`	the character used to represent gaps in the alignment matrix (if applicable). Ignored for `"DNAbin"` or `"AAbin"` objects. Defaults to "-" otherwise.
`...`	further arguments to be passed to `kmeans` (not including `centers`).

Details

This function clusters sequences into OTUs by first generating a matrix of k-mer counts, and then splitting the matrix into two subsets (row-wise) using the k-means algorithm (k = 2). The splitting continues recursively until the farthest k-mer distance in every cluster is below the threshold value.

This is a divisive, or "top-down" approach to OTU clustering, as opposed to agglomerative "bottom-up" methods. It is particularly useful for large large datasets with many sequences (n > 10, 000) since the need to compute a large n * n distance matrix is circumvented. This effectively reduces the time and memory complexity from quadratic to linear, while generally maintaining comparable accuracy.

It is recommended to increase the value of nstart passed to kmeans via the ... argument to at least 20. While this can increase computation time, it can improve clustering accuracy considerably.

DNA and amino acid sequences can be passed to the function either as a list of non-aligned sequences or a matrix of aligned sequences, preferably in the "DNAbin" or "AAbin" raw-byte format (Paradis et al 2004, 2012; see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" and 0.5 to that of "ACGGT").

To minimize computation time when counting longer k-mers (k > 3), amino acid sequences in the raw "AAbin" format are automatically compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004). Note that amino acid sequences will not be compressed if they are supplied as a list of character vectors rather than an "AAbin" object, in which case the k-mer length should be reduced (k < 4) to avoid excessive memory use and computation time.

Value

a named integer vector of cluster membership with values ranging from 1 to the total number of OTUs. Asterisks indicate the representative sequence within each cluster.

Author(s)

Shaun Wilkinson

References

Edgar RC (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380-385.

Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.

Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.

Examples

## Not run: 
## Cluster the woodmouse dataset (from the ape package) into OTUs
library(ape)
data(woodmouse)
## trim gappy ends to subset global alignment
woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))]
## cluster sequences into OTUs at 0.97 threshold with kmer size = 5
suppressWarnings(RNGversion("3.5.0"))
set.seed(999)
woodmouse.OTUs <- otu(woodmouse, k = 5, threshold = 0.97, nstart = 20)
woodmouse.OTUs

## End(Not run)

kmer documentation built on Jan. 23, 2026, 9:07 a.m.