seqComplexity: Determine if input sequence(s) are low complexity.
In dada2: Accurate, high-resolution sample inference from amplicon sequencing data

Description Usage Arguments Details Value See Also Examples

This function calculates the kmer complexity of input sequences. Complexity is quantified as the Shannon richness of kmers, which can be thought of as the effective number of kmers if they were all at equal frequencies. If a window size is provided, the minimum Shannon richness observed over sliding window along the sequence is returned.

1	seqComplexity(seqs, kmerSize = 2, window = NULL, by = 5, ...)

`seqs`	(Required). A `character` vector of A/C/G/T sequences, or any object coercible by `getSequences`.
`kmerSize`	(Optional). Default 2. The size of the kmers (or "oligonucleotides" or "words") to use.
`window`	(Optional). Default NULL. The width in nucleotides of the moving window. If NULL the whole sequence is used.
`by`	(Optional). Default 5. The step size in nucleotides between each moving window tested.
`...`	(Optional). Ignored.

This function can be used to identify potentially artefactual or undesirable low-complexity sequences, or sequences with low-complexity regions, as are sometimes observed in Illumina sequencing runs. When such artefactual sequences are present, the Shannon kmer richness values returned by this function will typically show a clear bimodal signal.

Kmers with non-ACGT characters are ignored. Also note that no correction is performed for sequence lengths. This is important when using longer kmer lengths, where 4^wordSize approaches the length of the sequence, as shorter sequences will then have a lower effective richness simply due to their being too little sequence to sample all the possible kmers.

numeric. A vector of minimum kmer complexities for each sequence.

plotComplexity oligonucleotideFrequency

sq.norm <- "TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGA"
sq.lowc <- "TCCTTCTTCTCCTCTCTTTCTCCTTCTTTCTTTTTTTTCCCTTTCTCTTCTTCTTTTTCTTCCTTCCTTTTTTC"
sq.part <- "TTTTTCTTCTCCCCCTTCCCCTTTCCTTTTCTCCTTTTTTCCTTTAGTGCAGTTGAGGCAGGCGGAATTCGTGG"
sqs <- c(sq.norm, sq.lowc, sq.part)
seqComplexity(sqs)
seqComplexity(sqs, kmerSize=3, window=25)