Description Usage Arguments Details Value Author(s) See Also Examples
Given a biological sequence (or a set of biological sequences),
the alphabetFrequency
function computes the frequency of
each letter of the relevant alphabet.
letterFrequency
is similar, but more compact if one is only
interested in certain letters.
It can also tabulate letters "in common".
letterFrequencyInSlidingView
is a more specialized version
of letterFrequency
for (non-masked) XString objects.
It tallys the requested letter frequencies for a fixed-width view,
or window, that is conceptually slid along the entire input sequence.
The consensusMatrix
function computes the consensus matrix
of a set of sequences, and the consensusString
function creates
the consensus sequence from the consensus matrix based upon specified
criteria.
In this man page we call "DNA input" (or "RNA input") an XString, XStringSet, XStringViews or MaskedXString object of base type DNA (or RNA).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | alphabetFrequency(x, as.prob=FALSE, ...)
hasOnlyBaseLetters(x)
uniqueLetters(x)
letterFrequency(x, letters, OR="|", as.prob=FALSE, ...)
letterFrequencyInSlidingView(x, view.width, letters, OR="|", as.prob=FALSE)
consensusMatrix(x, as.prob=FALSE, shift=0L, width=NULL, ...)
## S4 method for signature 'matrix'
consensusString(x, ambiguityMap="?", threshold=0.5)
## S4 method for signature 'DNAStringSet'
consensusString(x, ambiguityMap=IUPAC_CODE_MAP,
threshold=0.25, shift=0L, width=NULL)
## S4 method for signature 'RNAStringSet'
consensusString(x,
ambiguityMap=
structure(as.character(RNAStringSet(DNAStringSet(IUPAC_CODE_MAP))),
names=
as.character(RNAStringSet(DNAStringSet(names(IUPAC_CODE_MAP))))),
threshold=0.25, shift=0L, width=NULL)
|
x |
An XString, XStringSet, XStringViews
or MaskedXString object for DNA or RNA input for An XString object for A character vector, or an XStringSet or XStringViews
object for A consensus matrix (as returned by |
as.prob |
If |
view.width |
For |
letters |
For |
OR |
For |
ambiguityMap |
Either a single character to use when agreement is not reached or
a named character vector where the names are the ambiguity characters
and the values are the combinations of letters that comprise the
ambiguity (e.g. |
threshold |
The minimum probability threshold for an agreement to be declared.
When |
... |
Further arguments to be passed to or from other methods. For the XStringViews and XStringSet methods,
the Except for |
shift |
An integer vector (recycled to the length of |
width |
The number of columns of the returned matrix for the The length of the returned sequence for the |
alphabetFrequency
, letterFrequency
, and
letterFrequencyInSlidingView
are
generic functions defined in the Biostrings package.
letterFrequency
is similar to alphabetFrequency
but
specific to the letters of interest, hence more compact, especially
with OR
non-zero.
letterFrequencyInSlidingView
yields the same result, on the
sequence x
, that letterFrequency
would, if applied to the
hypothetical (and possibly huge) XStringViews
object
consisting of all the intervals of length view.width
on x
.
Taking advantage of the knowledge that successive "views" are nearly
identical, for letter counting purposes, it is both lighter and faster.
For letterFrequencyInSlidingView
, a masked (MaskedXString)
object x
is only supported through a cast to an (ordinary)
XString such as unmasked
(which includes its masked
regions).
When consensusString
is executed with a named character
ambiguityMap
argument, it weights each input string equally and
assigns an equal probability to each of the base letters represented by
an ambiguity letter. So for DNA and a threshold
of 0.25,
a "G" and an "R" would result in an "R" since
1/2 "G" + 1/2 "R" = 3/4 "G" + 1/4 "A" => "R";
two "G"'s and one "R" would result in a "G" since
2/3 "G" + 1/3 "R" = 5/6 "G" + 1/6 "A" => "G"; and
one "A" and one "N" would result in an "N" since
1/2 "A" + 1/2 "N" = 5/8 "A" + 1/8 "C" + 1/8 "G" + 1/8 "T" => "N".
alphabetFrequency
returns an integer vector when x
is an
XString or MaskedXString object. When x
is an
XStringSet or XStringViews object, then it returns
an integer matrix with length(x)
rows where the
i
-th row contains the frequencies for x[[i]]
.
If x
is a DNA or RNA input, then the returned vector is named
with the letters in the alphabet. If the baseOnly
argument is
TRUE
, then the returned vector has only 5 elements: 4 elements
corresponding to the 4 nucleotides + the 'other' element.
letterFrequency
returns, similarly, an integer vector or matrix,
but restricted and/or collated according to letters
and OR
.
letterFrequencyInSlidingView
returns, for an XString
object x
of length (nchar
) L, an integer matrix
with L-view.width+1
rows, the i
-th of which holding the
letter frequencies of substring(x, i, i+view.width-1)
.
hasOnlyBaseLetters
returns TRUE
or FALSE
indicating
whether or not x
contains only base letters (i.e. As, Cs, Gs and Ts
for DNA input and As, Cs, Gs and Us for RNA input).
uniqueLetters
returns a vector of 1-letter or empty strings. The empty
string is used to represent the nul character if x
happens to contain
any. Note that this can only happen if the base class of x
is BString.
An integer matrix with letters as row names for consensusMatrix
.
A standard character string for consensusString
.
H. Pag<c3><a8>s and P. Aboyoun; H. Jaffee for letterFrequency and letterFrequencyInSlidingView
alphabet
,
coverage
,
oligonucleotideFrequency
,
countPDict
,
XString-class,
XStringSet-class,
XStringViews-class,
MaskedXString-class,
strsplit
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | ## ---------------------------------------------------------------------
## alphabetFrequency()
## ---------------------------------------------------------------------
data(yeastSEQCHR1)
yeast1 <- DNAString(yeastSEQCHR1)
alphabetFrequency(yeast1)
alphabetFrequency(yeast1, baseOnly=TRUE)
hasOnlyBaseLetters(yeast1)
uniqueLetters(yeast1)
## With input made of multiple sequences:
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
alphabetFrequency(probes[1:50], baseOnly=TRUE)
alphabetFrequency(probes, baseOnly=TRUE, collapse=TRUE)
## ---------------------------------------------------------------------
## letterFrequency()
## ---------------------------------------------------------------------
letterFrequency(probes[[1]], letters="ACGT", OR=0)
base_letters <- alphabet(probes, baseOnly=TRUE)
base_letters
letterFrequency(probes[[1]], letters=base_letters, OR=0)
base_letter_freqs <- letterFrequency(probes, letters=base_letters, OR=0)
head(base_letter_freqs)
GC_content <- letterFrequency(probes, letters="CG")
head(GC_content)
letterFrequency(probes, letters="CG", collapse=TRUE)
## ---------------------------------------------------------------------
## letterFrequencyInSlidingView()
## ---------------------------------------------------------------------
data(yeastSEQCHR1)
x <- DNAString(yeastSEQCHR1)
view.width <- 48
letters <- c("A", "CG")
two_columns <- letterFrequencyInSlidingView(x, view.width, letters)
head(two_columns)
tail(two_columns)
three_columns <- letterFrequencyInSlidingView(x, view.width, letters, OR=0)
head(three_columns)
tail(three_columns)
stopifnot(identical(two_columns[ , "C|G"],
three_columns[ , "C"] + three_columns[ , "G"]))
## Note that, alternatively, 'three_columns' can also be obtained by
## creating the views on 'x' (as a Views object) and by calling
## alphabetFrequency() on it. But, of course, that is be *much* less
## efficient (both, in terms of memory and speed) than using
## letterFrequencyInSlidingView():
v <- Views(x, start=seq_len(length(x) - view.width + 1), width=view.width)
v
three_columns2 <- alphabetFrequency(v, baseOnly=TRUE)[ , c("A", "C", "G")]
stopifnot(identical(three_columns2, three_columns))
## Set the width of the view to length(x) to get the global frequencies:
letterFrequencyInSlidingView(x, letters="ACGTN", view.width=length(x), OR=0)
## ---------------------------------------------------------------------
## consensus*()
## ---------------------------------------------------------------------
## Read in ORF data:
file <- system.file("extdata", "someORF.fa", package="Biostrings")
orf <- readDNAStringSet(file)
## To illustrate, the following example assumes the ORF data
## to be aligned for the first 10 positions (patently false):
orf10 <- DNAStringSet(orf, end=10)
consensusMatrix(orf10, baseOnly=TRUE)
## The following example assumes the first 10 positions to be aligned
## after some incremental shifting to the right (patently false):
consensusMatrix(orf10, baseOnly=TRUE, shift=0:6)
consensusMatrix(orf10, baseOnly=TRUE, shift=0:6, width=10)
## For the character matrix containing the "exploded" representation
## of the strings, do:
as.matrix(orf10, use.names=FALSE)
## consensusMatrix() can be used to just compute the alphabet frequency
## for each position in the input sequences:
consensusMatrix(probes, baseOnly=TRUE)
## After sorting, the first 5 probes might look similar (at least on
## their first bases):
consensusString(sort(probes)[1:5])
consensusString(sort(probes)[1:5], ambiguityMap = "N", threshold = 0.5)
## Consensus involving ambiguity letters in the input strings
consensusString(DNAStringSet(c("NNNN","ACTG")))
consensusString(DNAStringSet(c("AANN","ACTG")))
consensusString(DNAStringSet(c("ACAG","ACAR")))
consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG")))
## ---------------------------------------------------------------------
## C. RELATIONSHIP BETWEEN consensusMatrix() AND coverage()
## ---------------------------------------------------------------------
## Applying colSums() on a consensus matrix gives the coverage that
## would be obtained by piling up (after shifting) the input sequences
## on top of an (imaginary) reference sequence:
cm <- consensusMatrix(orf10, shift=0:6, width=10)
colSums(cm)
## Note that this coverage can also be obtained with:
as.integer(coverage(IRanges(rep(1, length(orf)), width(orf)), shift=0:6, width=10))
|
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: XVector
Attaching package: ‘Biostrings’
The following object is masked from ‘package:base’:
strsplit
A C G T M R W S Y K V H D
69830 44643 45765 69970 0 0 0 0 0 0 0 0 0
B N - + .
0 0 0 0 0
A C G T other
69830 44643 45765 69970 0
[1] TRUE
[1] "A" "C" "G" "T"
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
A C G T other
[1,] 6 8 4 7 0
[2,] 7 6 5 7 0
[3,] 7 6 6 6 0
[4,] 5 7 7 6 0
[5,] 5 7 8 5 0
[6,] 8 5 7 5 0
[7,] 9 7 3 6 0
[8,] 8 6 7 4 0
[9,] 10 5 3 7 0
[10,] 10 6 3 6 0
[11,] 7 6 7 5 0
[12,] 11 7 3 4 0
[13,] 7 8 2 8 0
[14,] 2 7 5 11 0
[15,] 6 6 7 6 0
[16,] 5 7 6 7 0
[17,] 6 6 8 5 0
[18,] 6 6 8 5 0
[19,] 7 7 7 4 0
[20,] 6 8 6 5 0
[21,] 6 7 7 5 0
[22,] 10 5 7 3 0
[23,] 11 5 6 3 0
[24,] 10 5 4 6 0
[25,] 8 6 1 10 0
[26,] 8 9 1 7 0
[27,] 6 9 4 6 0
[28,] 5 4 11 5 0
[29,] 7 7 5 6 0
[30,] 3 7 4 11 0
[31,] 9 6 2 8 0
[32,] 6 6 5 8 0
[33,] 4 6 8 7 0
[34,] 4 7 6 8 0
[35,] 8 8 4 5 0
[36,] 8 4 7 6 0
[37,] 8 6 5 6 0
[38,] 6 6 5 8 0
[39,] 5 7 8 5 0
[40,] 8 6 6 5 0
[41,] 7 7 7 4 0
[42,] 6 4 5 10 0
[43,] 8 8 5 4 0
[44,] 4 5 7 9 0
[45,] 5 6 8 6 0
[46,] 7 8 5 5 0
[47,] 7 8 4 6 0
[48,] 6 6 7 6 0
[49,] 4 6 8 7 0
[50,] 4 5 8 8 0
A C G T other
1676179 1671151 1594446 1693224 0
A C G T
6 8 4 7
[1] "A" "C" "G" "T"
A C G T
6 8 4 7
A C G T
[1,] 6 8 4 7
[2,] 7 6 5 7
[3,] 7 6 6 6
[4,] 5 7 7 6
[5,] 5 7 8 5
[6,] 8 5 7 5
C|G
[1,] 12
[2,] 11
[3,] 12
[4,] 14
[5,] 15
[6,] 12
C|G
3265597
A C|G
[1,] 19 29
[2,] 19 29
[3,] 19 29
[4,] 18 30
[5,] 19 29
[6,] 18 30
A C|G
[230156,] 0 30
[230157,] 0 30
[230158,] 0 29
[230159,] 0 29
[230160,] 0 30
[230161,] 0 30
A C G
[1,] 19 29 0
[2,] 19 29 0
[3,] 19 29 0
[4,] 18 30 0
[5,] 19 29 0
[6,] 18 30 0
A C G
[230156,] 0 0 30
[230157,] 0 0 30
[230158,] 0 0 29
[230159,] 0 0 29
[230160,] 0 0 30
[230161,] 0 0 30
Views on a 230208-letter DNAString subject
subject: CCACACCACACCCACACACCCACACACCACACCA...TGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
views:
start end width
[1] 1 48 48 [CCACACCACACCCACACACCCA...CCACACCACACACCACACCACA]
[2] 2 49 48 [CACACCACACCCACACACCCAC...CACACCACACACCACACCACAC]
[3] 3 50 48 [ACACCACACCCACACACCCACA...ACACCACACACCACACCACACC]
[4] 4 51 48 [CACCACACCCACACACCCACAC...CACCACACACCACACCACACCC]
[5] 5 52 48 [ACCACACCCACACACCCACACA...ACCACACACCACACCACACCCA]
... ... ... ... ...
[230157] 230157 230204 48 [GGTGTGGGTGTGGTGTGGTGTG...TGTGGTGTGGGTGTGGTGTGTG]
[230158] 230158 230205 48 [GTGTGGGTGTGGTGTGGTGTGT...GTGGTGTGGGTGTGGTGTGTGT]
[230159] 230159 230206 48 [TGTGGGTGTGGTGTGGTGTGTG...TGGTGTGGGTGTGGTGTGTGTG]
[230160] 230160 230207 48 [GTGGGTGTGGTGTGGTGTGTGG...GGTGTGGGTGTGGTGTGTGTGG]
[230161] 230161 230208 48 [TGGGTGTGGTGTGGTGTGTGGG...GTGTGGGTGTGGTGTGTGTGGG]
A C G T N
[1,] 69830 44643 45765 69970 0
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A 2 2 2 0 4 3 3 3 2 1
C 3 1 2 2 2 1 0 0 2 3
G 1 1 1 2 1 0 3 3 1 2
T 1 3 2 3 0 3 1 1 2 1
other 0 0 0 0 0 0 0 0 0 0
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
A 1 0 0 0 2 1 4 2 4 1 3 1 2
C 0 1 1 2 1 2 1 1 0 2 3 0 1
G 0 0 0 0 1 2 0 3 2 1 0 2 1
T 0 1 2 2 1 1 2 1 1 3 0 2 0
other 0 0 0 0 0 0 0 0 0 0 0 0 0
[,14] [,15] [,16]
A 1 0 0
C 0 1 0
G 2 0 1
T 0 1 0
other 0 0 0
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A 1 0 0 0 2 1 4 2 4 1
C 0 1 1 2 1 2 1 1 0 2
G 0 0 0 0 1 2 0 3 2 1
T 0 1 2 2 1 1 2 1 1 3
other 0 0 0 0 0 0 0 0 0 0
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "A" "C" "T" "T" "G" "T" "A" "A" "A" "T"
[2,] "T" "T" "C" "C" "A" "A" "G" "G" "C" "C"
[3,] "C" "T" "T" "C" "A" "T" "G" "T" "C" "A"
[4,] "C" "A" "C" "T" "C" "A" "T" "A" "T" "C"
[5,] "A" "G" "A" "G" "A" "A" "A" "G" "A" "G"
[6,] "G" "T" "G" "T" "C" "C" "G" "G" "G" "C"
[7,] "C" "A" "A" "G" "A" "T" "A" "A" "T" "G"
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
A 89191 92082 75796 66864 58251 60832 59048 57840 58947 59089 61149 65433
C 22497 44721 59982 63857 76992 75497 72628 75823 79133 75208 72788 70599
G 98027 64220 67537 67201 69874 60639 66245 57260 61527 67086 62131 55959
T 55685 64377 62085 67478 60283 68432 67479 74477 65793 64017 69332 73409
other 0 0 0 0 0 0 0 0 0 0 0 0
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
A 72617 68041 56147 59231 61046 58223 59769 57775 61656 63262 71305 88914
C 54975 66917 73696 71279 70130 74022 74487 76839 75608 73229 70606 54117
G 50535 51712 67555 67205 62632 66451 66206 65394 63533 65623 60266 58448
T 87273 78730 68002 67685 71592 66704 64938 65392 64603 63286 63223 63921
other 0 0 0 0 0 0 0 0 0 0 0 0
[,25]
A 93671
C 45521
G 51180
T 75028
other 0
[1] "AAAAAACARSCYYMRGSMSGYTYRW"
[1] "AAAAAACANNCNCNAGNAGNCNCNN"
[1] "ACTG"
[1] "AMTG"
[1] "ACAR"
[1] "ACAG"
[1] 1 2 3 4 5 6 7 7 7 7
[1] 1 2 3 4 5 6 7 7 7 7
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.