ConsensusSequence: Create a Consensus Sequence
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) See Also Examples

Forms a consensus sequence representing a set of sequences.

ConsensusSequence(myXStringSet,
                  threshold = 0.05,
                  ambiguity = TRUE,
                  noConsensusChar = "+",
                  minInformation = 1 - threshold,
                  ignoreNonBases = FALSE,
                  includeTerminalGaps = FALSE)

`myXStringSet`	An `AAStringSet`, `DNAStringSet`, or `RNAStringSet` object of aligned sequences.
`threshold`	Numeric specifying that less than `threshold` fraction of sequence information can be lost at any position of the consensus sequence.
`ambiguity`	Logical specifying whether to consider ambiguity as split between their respective nucleotides. Degeneracy codes are specified in the `IUPAC_CODE_MAP`.
`noConsensusChar`	Single character from the sequence's alphabet giving the base to use when there is no consensus in a position.
`minInformation`	Minimum fraction of information required to form consensus in each position.
`ignoreNonBases`	Logical specifying whether to count gap ("-"), mask ("+"), and unknown (".") characters towards the consensus.
`includeTerminalGaps`	Logical specifying whether or not to include terminal gaps ("-" or "." characters on each end of the sequence) into the formation of consensus.

ConsensusSequence removes the least frequent characters at each position, so long as they represent less than threshold fraction of the sequences in total. If necessary, ConsensusSequence represents the remaining characters using a degeneracy code from the IUPAC_CODE_MAP. Degeneracy codes are always used in cases where multiple characters are equally abundant.

Two key parameters control the degree of consensus: threshold and minInformation. The default threshold (0.05) means that at less than 5% of sequences will not be represented by the consensus sequence at any given position. The default minInformation (1 - 0.05) specifies that at least 95% of sequences must contain the information in the consensus, otherwise the noConsensusChar is used. This enables an alternative character (e.g., "+") to be substituted at positions that would otherwise yield an ambiguity code.

If ambiguity = TRUE (the default) then degeneracy codes in myXStringSet are split between their respective bases according to the IUPAC_CODE_MAP for DNA/RNA and AMINO_ACID_CODE for AA. For example, an “R” in a DNAStringSet would count as half an “A” and half a “G”. If ambiguity = FALSE then degeneracy codes are not considered in forming the consensus. For an AAStringSet input, the lack of degeneracy codes generally results in “X” at positions with mismatches, unless the threshold is set to a higher value than the default.

If includeNonBases = TRUE (the default) then gap ("-"), mask ("+"), and unknown (".") characters are counted towards the consensus, otherwise they are omitted from calculation of the consensus. Note that gap ("-") and unknown (".") characters are treated interchangeably as gaps when forming the consensus sequence. For this reason, the consensus of a position with all unknown (".") characters will be a gap ("-"). Also, note that if consensus is formed between different length sequences then it will represent only the longest sequences at the end. For this reason the consensus sequence is generally based on a sequence alignment such that all of the sequences have equal lengths.

An XStringSet with a single consensus sequence matching the input type.

Erik Wright eswright@pitt.edu

Disambiguate, IdConsensus

db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
dna <- SearchDB(db, limit=10)
BrowseSeqs(dna) # consensus at bottom
BrowseSeqs(dna, threshold=0.5) # consensus at bottom

# controlling the degree of consensus
AAAT <- DNAStringSet(c("A", "A", "A", "T"))
ConsensusSequence(AAAT) # "W"
ConsensusSequence(AAAT, threshold=0.3) # "A"
ConsensusSequence(AAAT, threshold=0.3, minInformation=0.8) # "+"
ConsensusSequence(AAAT, threshold=0.3, minInformation=0.8, noConsensusChar="N") # "N"

# switch between degenerate-based and majority-based consensus
majority <- DNAStringSet(c("GTT", "GAA", "CTG"))
ConsensusSequence(majority) # degenerate-based
ConsensusSequence(majority, threshold=0.5) # majority-based
ConsensusSequence(majority, threshold=0.5, minInformation=0.75)

# behavior in the case of a tie
ConsensusSequence(DNAStringSet(c("A", "T"))) # "W"
ConsensusSequence(DNAStringSet(c("A", "T")), threshold=0.5) # "W"
ConsensusSequence(AAStringSet(c("A", "T"))) # "X"
ConsensusSequence(AAStringSet(c("A", "T")), threshold=0.5) # "X"
ConsensusSequence(AAStringSet(c("I", "L"))) # "J"
ConsensusSequence(AAStringSet(c("I", "L")), threshold=0.5) # "J"

# handling terminal gaps
dna <- DNAStringSet(c("ANGCT-","-ACCT-"))
ConsensusSequence(dna) # "ANSCT-"
ConsensusSequence(dna, includeTerminalGaps=TRUE) # "+NSCT-"

# the "." character is treated is a "-"
aa <- AAStringSet(c("ANQIH-", "ADELW."))
ConsensusSequence(aa) # "ABZJX-"

# internal non-bases are included by default
ConsensusSequence(DNAStringSet(c("A-+.A", "AAAAA")), noConsensusChar="N") # "ANNNA"
ConsensusSequence(DNAStringSet(c("A-+.A", "AAAAA")), ignoreNonBases=TRUE) # "AAAAA"

# degeneracy codes in the input are considered by default
ConsensusSequence(DNAStringSet(c("AWNDA", "AAAAA"))) # "AWNDA"
ConsensusSequence(DNAStringSet(c("AWNDA", "AAAAA")), ambiguity=FALSE) # "AAAAA"

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
    rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

Loading required package: RSQLite
Search Expression:
select row_names, sequence from _Seqs where row_names in (select row_names
from Seqs) limit 10

DNAStringSet of length: 10
Time difference of 0.04 secs

  A DNAStringSet instance of length 1
    width seq
[1]     1 W
  A DNAStringSet instance of length 1
    width seq
[1]     1 A
  A DNAStringSet instance of length 1
    width seq
[1]     1 +
  A DNAStringSet instance of length 1
    width seq
[1]     1 N
  A DNAStringSet instance of length 1
    width seq
[1]     3 SWD
  A DNAStringSet instance of length 1
    width seq
[1]     3 GTD
  A DNAStringSet instance of length 1
    width seq
[1]     3 ++D
  A DNAStringSet instance of length 1
    width seq
[1]     1 W
  A DNAStringSet instance of length 1
    width seq
[1]     1 W
  A AAStringSet instance of length 1
    width seq
[1]     1 X
  A AAStringSet instance of length 1
    width seq
[1]     1 X
  A AAStringSet instance of length 1
    width seq
[1]     1 J
  A AAStringSet instance of length 1
    width seq
[1]     1 J
  A DNAStringSet instance of length 1
    width seq
[1]     6 ANSCT-
  A DNAStringSet instance of length 1
    width seq
[1]     6 +NSCT-
  A AAStringSet instance of length 1
    width seq
[1]     6 ABZJX-
  A DNAStringSet instance of length 1
    width seq
[1]     5 ANNNA
  A DNAStringSet instance of length 1
    width seq
[1]     5 AAAAA
  A DNAStringSet instance of length 1
    width seq
[1]     5 AWNDA
  A DNAStringSet instance of length 1
    width seq
[1]     5 AAAAA

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.

DECIPHER index

Package overview Classify Sequences Design Group-Specific FISH Probes Design Group-Specific Primers Design Microarray Probes Design Primers That Yield Group-Specific Signatures Finding Chimeric Sequences Getting Started DECIPHERing The Art of Multiple Sequence Alignment in R The Magic of Gene Finding

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

ConsensusSequence: Create a Consensus Sequence
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Example output

Related to ConsensusSequence in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER Tools for curating, analyzing, and manipulating biological sequences

ConsensusSequence: Create a Consensus Sequence In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Example output

Related to ConsensusSequence in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

ConsensusSequence: Create a Consensus Sequence
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences