cluster_consensus: Calculate consensus of a cluster of sequences.
In brendanf/tzara: Cluster long amplicons using dada2 denoising on variable regions

Description Usage Arguments Details Value

This algorithm assumes that the sequences "should be" identical except for amplification and sequencing errors. Its main purpose is to calculate a consensus sequence for an amplicon that is too long to use in DADA2 directly, but which has been clustered based on sequence variant identity in one subregion.

cluster_consensus(seq, nread = 1, ..., ncpus = 1, simplify = TRUE)

## S3 method for class 'character'
cluster_consensus(
  seq,
  nread = 1,
  names = base::names(seq),
  dna2rna = TRUE,
  ...,
  ncpus = 1,
  simplify = TRUE
)

## S3 method for class 'XStringSet'
cluster_consensus(seq, nread = 1, ..., ncpus = 1, simplify = TRUE)

`seq`	(`character` vector or `XStringSet-class`) The sequences to calculate a consensus for.
`nread`	(`integer` vector) For the purposes of calculating the consensus, consider each read to occur `nread` times. Supplying unique values for `seq` along with the corresponding `nread` is much faster than supplying duplicate reads to `cluster_consensus`.
`...`	passed to methods
`ncpus`	(`integer`) Number of CPUs to use.
`simplify`	(`logical`) If `TRUE`, return an object of the same type as `seq` containing a single sequence representing the consensus. If `FALSE`, an object of the same type as `seq` representing the consensus sequence for reads which were included in the consensus, or `NA_character_` for reads which were initially `NA` or which were removed from the consensus alignment as outliers. For the `XStringSet-class` method, which does not allow `NA` entries, these elements are missing from the set (this can be deduced by the names).
`names`	(`character`) If `seq` is a `character` vector, names for the sequences.
`dna2rna`	(logical) whether to convert `seq` from DNA to RNA, and use (calculated) RNA secondary structure in alignments.

The sequences are first aligned using AlignSeqs. Sequences which are "outliers" in the alignment are then removed by odseq. If the input sequences were clustered based on DADA2 sequence variants of a variable region, and the sequences were appropriately quality filtered prior to running dada, then outliers should mostly be chimeras.

After outlier removal, sites with greater than 50% gaps are removed, and the most frequent letter (ignoring gaps) is chosen at all other sites. If no letter has greater than 50% representation at a position, then an IUPAC ambiguous base representing at least 50% of the reads at that position is chosen for nucleotide sequences, or "X" for amino acids.

an XStringSet-class representing the consensus sequence.

brendanf/tzara documentation built on March 11, 2021, 5:40 a.m.