cluster_consensus: Calculate consensus of a cluster of sequences.

Description Usage Arguments Details Value

View source: R/tzara.R

Description

This algorithm assumes that the sequences "should be" identical except for amplification and sequencing errors. Its main purpose is to calculate a consensus sequence for an amplicon that is too long to use in DADA2 directly, but which has been clustered based on sequence variant identity in one subregion.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
cluster_consensus(seq, nread = 1, ..., ncpus = 1, simplify = TRUE)

## S3 method for class 'character'
cluster_consensus(
  seq,
  nread = 1,
  names = base::names(seq),
  dna2rna = TRUE,
  ...,
  ncpus = 1,
  simplify = TRUE
)

## S3 method for class 'XStringSet'
cluster_consensus(seq, nread = 1, ..., ncpus = 1, simplify = TRUE)

Arguments

seq

(character vector or XStringSet-class) The sequences to calculate a consensus for.

nread

(integer vector) For the purposes of calculating the consensus, consider each read to occur nread times. Supplying unique values for seq along with the corresponding nread is much faster than supplying duplicate reads to cluster_consensus.

...

passed to methods

ncpus

(integer) Number of CPUs to use.

simplify

(logical) If TRUE, return an object of the same type as seq containing a single sequence representing the consensus. If FALSE, an object of the same type as seq representing the consensus sequence for reads which were included in the consensus, or NA_character_ for reads which were initially NA or which were removed from the consensus alignment as outliers. For the XStringSet-class method, which does not allow NA entries, these elements are missing from the set (this can be deduced by the names).

names

(character) If seq is a character vector, names for the sequences.

dna2rna

(logical) whether to convert seq from DNA to RNA, and use (calculated) RNA secondary structure in alignments.

Details

The sequences are first aligned using AlignSeqs. Sequences which are "outliers" in the alignment are then removed by odseq. If the input sequences were clustered based on DADA2 sequence variants of a variable region, and the sequences were appropriately quality filtered prior to running dada, then outliers should mostly be chimeras.

After outlier removal, sites with greater than 50% gaps are removed, and the most frequent letter (ignoring gaps) is chosen at all other sites. If no letter has greater than 50% representation at a position, then an IUPAC ambiguous base representing at least 50% of the reads at that position is chosen for nucleotide sequences, or "X" for amino acids.

Value

an XStringSet-class representing the consensus sequence.


brendanf/tzara documentation built on March 11, 2021, 5:40 a.m.