letterFrequency | R Documentation |
Given a biological sequence (or a set of biological sequences),
the alphabetFrequency
function computes the frequency of
each letter of the relevant alphabet.
letterFrequency
is similar, but more compact if one is only
interested in certain letters.
It can also tabulate letters "in common".
letterFrequencyInSlidingView
is a more specialized version
of letterFrequency
for (non-masked) XString objects.
It tallys the requested letter frequencies for a fixed-width view,
or window, that is conceptually slid along the entire input sequence.
The consensusMatrix
function computes the consensus matrix
of a set of sequences, and the consensusString
function creates
the consensus sequence from the consensus matrix based upon specified
criteria.
In this man page we call "DNA input" (or "RNA input") an XString, XStringSet, XStringViews or MaskedXString object of base type DNA (or RNA).
alphabetFrequency(x, as.prob=FALSE, ...)
hasOnlyBaseLetters(x)
uniqueLetters(x)
letterFrequency(x, letters, OR="|", as.prob=FALSE, ...)
letterFrequencyInSlidingView(x, view.width, letters, OR="|", as.prob=FALSE)
consensusMatrix(x, as.prob=FALSE, shift=0L, width=NULL, ...)
## S4 method for signature 'matrix'
consensusString(x, ambiguityMap="?", threshold=0.5)
## S4 method for signature 'DNAStringSet'
consensusString(x, ambiguityMap=IUPAC_CODE_MAP,
threshold=0.25, shift=0L, width=NULL)
## S4 method for signature 'RNAStringSet'
consensusString(x,
ambiguityMap=
structure(as.character(RNAStringSet(DNAStringSet(IUPAC_CODE_MAP))),
names=
as.character(RNAStringSet(DNAStringSet(names(IUPAC_CODE_MAP))))),
threshold=0.25, shift=0L, width=NULL)
x |
An XString, XStringSet, XStringViews
or MaskedXString object for DNA or RNA input for An XString object for A character vector, or an XStringSet or XStringViews
object for A consensus matrix (as returned by |
as.prob |
If |
view.width |
For |
letters |
For |
OR |
For |
ambiguityMap |
Either a single character to use when agreement is not reached or
a named character vector where the names are the ambiguity characters
and the values are the combinations of letters that comprise the
ambiguity (e.g. |
threshold |
The minimum probability threshold for an agreement to be declared.
When |
... |
Further arguments to be passed to or from other methods. For the XStringViews and XStringSet methods,
the Except for |
shift |
An integer vector (recycled to the length of |
width |
The number of columns of the returned matrix for the The length of the returned sequence for the |
alphabetFrequency
, letterFrequency
, and
letterFrequencyInSlidingView
are
generic functions defined in the Biostrings package.
letterFrequency
is similar to alphabetFrequency
but
specific to the letters of interest, hence more compact, especially
with OR
non-zero.
letterFrequencyInSlidingView
yields the same result, on the
sequence x
, that letterFrequency
would, if applied to the
hypothetical (and possibly huge) XStringViews
object
consisting of all the intervals of length view.width
on x
.
Taking advantage of the knowledge that successive "views" are nearly
identical, for letter counting purposes, it is both lighter and faster.
For letterFrequencyInSlidingView
, a masked (MaskedXString)
object x
is only supported through a cast to an (ordinary)
XString such as unmasked
(which includes its masked
regions).
When consensusString
is executed with a named character
ambiguityMap
argument, it weights each input string equally and
assigns an equal probability to each of the base letters represented by
an ambiguity letter. So for DNA and a threshold
of 0.25,
a "G" and an "R" would result in an "R" since
1/2 "G" + 1/2 "R" = 3/4 "G" + 1/4 "A" => "R";
two "G"'s and one "R" would result in a "G" since
2/3 "G" + 1/3 "R" = 5/6 "G" + 1/6 "A" => "G"; and
one "A" and one "N" would result in an "N" since
1/2 "A" + 1/2 "N" = 5/8 "A" + 1/8 "C" + 1/8 "G" + 1/8 "T" => "N".
alphabetFrequency
returns an integer vector when x
is an
XString or MaskedXString object. When x
is an
XStringSet or XStringViews object, then it returns
an integer matrix with length(x)
rows where the
i
-th row contains the frequencies for x[[i]]
.
If x
is a DNA, RNA, or AA input, then the returned vector is named
with the letters in the alphabet. If the baseOnly
argument is
TRUE
, then the returned vector has only 5 elements for DNA/RNA input
(4 elements corresponding to the 4 nucleotides + the 'other' element) and
21 elements for AA input (20 elements corresponding to the 20 base amino acids
+ the 'other' element).
letterFrequency
returns, similarly, an integer vector or matrix,
but restricted and/or collated according to letters
and OR
.
letterFrequencyInSlidingView
returns, for an XString
object x
of length (nchar
) L, an integer matrix
with L-view.width+1
rows, the i
-th of which holding the
letter frequencies of substring(x, i, i+view.width-1)
.
hasOnlyBaseLetters
returns TRUE
or FALSE
indicating
whether or not x
contains only base letters (i.e. As, Cs, Gs and Ts
for DNA input, As, Cs, Gs and Us for RNA input, or any of the 20 standard
amino acids for AA input).
uniqueLetters
returns a vector of 1-letter or empty strings. The empty
string is used to represent the nul character if x
happens to contain
any. Note that this can only happen if the base class of x
is BString.
An integer matrix with letters as row names for consensusMatrix
.
A standard character string for consensusString
.
H. Pagès and P. Aboyoun; H. Jaffee for letterFrequency and letterFrequencyInSlidingView
alphabet
,
coverage
,
oligonucleotideFrequency
,
countPDict
,
XString-class,
XStringSet-class,
XStringViews-class,
MaskedXString-class,
strsplit
## ---------------------------------------------------------------------
## alphabetFrequency()
## ---------------------------------------------------------------------
data(yeastSEQCHR1)
yeast1 <- DNAString(yeastSEQCHR1)
alphabetFrequency(yeast1)
alphabetFrequency(yeast1, baseOnly=TRUE)
hasOnlyBaseLetters(yeast1)
uniqueLetters(yeast1)
## With input made of multiple sequences:
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
alphabetFrequency(probes[1:50], baseOnly=TRUE)
alphabetFrequency(probes, baseOnly=TRUE, collapse=TRUE)
## ---------------------------------------------------------------------
## letterFrequency()
## ---------------------------------------------------------------------
letterFrequency(probes[[1]], letters="ACGT", OR=0)
base_letters <- alphabet(probes, baseOnly=TRUE)
base_letters
letterFrequency(probes[[1]], letters=base_letters, OR=0)
base_letter_freqs <- letterFrequency(probes, letters=base_letters, OR=0)
head(base_letter_freqs)
GC_content <- letterFrequency(probes, letters="CG")
head(GC_content)
letterFrequency(probes, letters="CG", collapse=TRUE)
## ---------------------------------------------------------------------
## letterFrequencyInSlidingView()
## ---------------------------------------------------------------------
data(yeastSEQCHR1)
x <- DNAString(yeastSEQCHR1)
view.width <- 48
letters <- c("A", "CG")
two_columns <- letterFrequencyInSlidingView(x, view.width, letters)
head(two_columns)
tail(two_columns)
three_columns <- letterFrequencyInSlidingView(x, view.width, letters, OR=0)
head(three_columns)
tail(three_columns)
stopifnot(identical(two_columns[ , "C|G"],
three_columns[ , "C"] + three_columns[ , "G"]))
## Note that, alternatively, 'three_columns' can also be obtained by
## creating the views on 'x' (as a Views object) and by calling
## alphabetFrequency() on it. But, of course, that is be *much* less
## efficient (both, in terms of memory and speed) than using
## letterFrequencyInSlidingView():
v <- Views(x, start=seq_len(length(x) - view.width + 1), width=view.width)
v
three_columns2 <- alphabetFrequency(v, baseOnly=TRUE)[ , c("A", "C", "G")]
stopifnot(identical(three_columns2, three_columns))
## Set the width of the view to length(x) to get the global frequencies:
letterFrequencyInSlidingView(x, letters="ACGTN", view.width=length(x), OR=0)
## ---------------------------------------------------------------------
## consensus*()
## ---------------------------------------------------------------------
## Read in ORF data:
file <- system.file("extdata", "someORF.fa", package="Biostrings")
orf <- readDNAStringSet(file)
## To illustrate, the following example assumes the ORF data
## to be aligned for the first 10 positions (patently false):
orf10 <- DNAStringSet(orf, end=10)
consensusMatrix(orf10, baseOnly=TRUE)
## The following example assumes the first 10 positions to be aligned
## after some incremental shifting to the right (patently false):
consensusMatrix(orf10, baseOnly=TRUE, shift=0:6)
consensusMatrix(orf10, baseOnly=TRUE, shift=0:6, width=10)
## For the character matrix containing the "exploded" representation
## of the strings, do:
as.matrix(orf10, use.names=FALSE)
## consensusMatrix() can be used to just compute the alphabet frequency
## for each position in the input sequences:
consensusMatrix(probes, baseOnly=TRUE)
## After sorting, the first 5 probes might look similar (at least on
## their first bases):
consensusString(sort(probes)[1:5])
consensusString(sort(probes)[1:5], ambiguityMap = "N", threshold = 0.5)
## Consensus involving ambiguity letters in the input strings
consensusString(DNAStringSet(c("NNNN","ACTG")))
consensusString(DNAStringSet(c("AANN","ACTG")))
consensusString(DNAStringSet(c("ACAG","ACAR")))
consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG")))
## ---------------------------------------------------------------------
## C. RELATIONSHIP BETWEEN consensusMatrix() AND coverage()
## ---------------------------------------------------------------------
## Applying colSums() on a consensus matrix gives the coverage that
## would be obtained by piling up (after shifting) the input sequences
## on top of an (imaginary) reference sequence:
cm <- consensusMatrix(orf10, shift=0:6, width=10)
colSums(cm)
## Note that this coverage can also be obtained with:
as.integer(coverage(IRanges(rep(1, length(orf)), width(orf)), shift=0:6, width=10))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.