letterFrequency: Calculate the frequency of letters in a biological sequence,...
In Biostrings: Efficient manipulation of biological strings

Description Usage Arguments Details Value Author(s) See Also Examples

Given a biological sequence (or a set of biological sequences), the alphabetFrequency function computes the frequency of each letter of the relevant alphabet.

letterFrequency is similar, but more compact if one is only interested in certain letters. It can also tabulate letters "in common".

letterFrequencyInSlidingView is a more specialized version of letterFrequency for (non-masked) XString objects. It tallys the requested letter frequencies for a fixed-width view, or window, that is conceptually slid along the entire input sequence.

The consensusMatrix function computes the consensus matrix of a set of sequences, and the consensusString function creates the consensus sequence from the consensus matrix based upon specified criteria.

In this man page we call "DNA input" (or "RNA input") an XString, XStringSet, XStringViews or MaskedXString object of base type DNA (or RNA).

alphabetFrequency(x, as.prob=FALSE, ...)
hasOnlyBaseLetters(x)
uniqueLetters(x)

letterFrequency(x, letters, OR="|", as.prob=FALSE, ...)
letterFrequencyInSlidingView(x, view.width, letters, OR="|", as.prob=FALSE)

consensusMatrix(x, as.prob=FALSE, shift=0L, width=NULL, ...)

## S4 method for signature 'matrix'
consensusString(x, ambiguityMap="?", threshold=0.5)
## S4 method for signature 'DNAStringSet'
consensusString(x, ambiguityMap=IUPAC_CODE_MAP,
             threshold=0.25, shift=0L, width=NULL)
## S4 method for signature 'RNAStringSet'
consensusString(x, 
             ambiguityMap=
             structure(as.character(RNAStringSet(DNAStringSet(IUPAC_CODE_MAP))),
                       names=
                       as.character(RNAStringSet(DNAStringSet(names(IUPAC_CODE_MAP))))),
             threshold=0.25, shift=0L, width=NULL)

`x`	An XString, XStringSet, XStringViews or MaskedXString object for `alphabetFrequency`, `letterFrequency`, or `uniqueLetters`. DNA or RNA input for `hasOnlyBaseLetters`. An XString object for `letterFrequencyInSlidingView`. A character vector, or an XStringSet or XStringViews object for `consensusMatrix`. A consensus matrix (as returned by `consensusMatrix`), or an XStringSet or XStringViews object for `consensusString`.
`as.prob`	If `TRUE` then probabilities are reported, otherwise counts (the default).
`view.width`	For `letterFrequencyInSlidingView`, the constant (e.g. 35, 48, 1000) size of the "window" to slide along `x`. The specified `letters` are tabulated in each window of length `view.width`. The rows of the result (see value) correspond to the various windows.
`letters`	For `letterFrequency` or `letterFrequencyInSlidingView`, a character vector (e.g. "C", "CG", c("C", "G")) giving the letters to tabulate. When `x` is DNA or RNA input, `letters` must come from `alphabet(x)`. Except with `OR=0`, multi-character elements of letters ('nchar' > 1) are taken as groupings of letters into subsets, to be tabulated in common ("or"'d), as if their alphabetFrequency's were added (Arithmetic). The columns of the result (see value) correspond to the individual and sets of letters which are counted separately. Unrelated (and, with some post-processing, related) counts may of course be obtained in separate calls.
`OR`	For `letterFrequency` or `letterFrequencyInSlidingView`, the string (default `\|`) to use as a separator in forming names for the "grouped" columns, e.g. "C\|G". The otherwise exceptional value `0` (zero) disables or'ing and is provided for convenience, allowing a single multi-character string (or several strings) of letters that should be counted separately. If some but not all letters are to be counted separately, they must reside in separate elements of letters (with 'nchar' 1 unless they are to be grouped with other letters), and `OR` cannot be 0.
`ambiguityMap`	Either a single character to use when agreement is not reached or a named character vector where the names are the ambiguity characters and the values are the combinations of letters that comprise the ambiguity (e.g. `link{IUPAC_CODE_MAP}`). When `ambiguityMap` is a named character vector, occurrences of ambiguous letters in `x` are replaced with their base alphabet letters that have been equally weighted to sum to 1. (See Details for some examples.)
`threshold`	The minimum probability threshold for an agreement to be declared. When `ambiguityMap` is a single character, `threshold` is a single number in (0, 1]. When `ambiguityMap` is a named character vector (e.g. `link{IUPAC_CODE_MAP}`), `threshold` is a single number in (0, 1/sum(nchar(ambiguityMap) == 1)].
`...`	Further arguments to be passed to or from other methods. For the XStringViews and XStringSet methods, the `collapse` argument is accepted. Except for `letterFrequency` or `letterFrequencyInSlidingView`, and with DNA or RNA input, the `baseOnly` argument is accepted. If `baseOnly` is `TRUE`, the returned vector (or matrix) only contains the frequencies of the letters that belong to the "base" alphabet of `x` i.e. to the alphabet returned by `alphabet(x, baseOnly=TRUE)`.
`shift`	An integer vector (recycled to the length of `x`) specifying how each sequence in `x` should be (horizontally) shifted with respect to the first column of the consensus matrix to be returned. By default (`shift=0`), each sequence in `x` has its first letter aligned with the first column of the matrix. A positive `shift` value means that the corresponding sequence must be shifted to the right, and a negative `shift` value that it must be shifted to the left. For example, a shift of 5 means that it must be shifted 5 positions to the right (i.e. the first letter in the sequence must be aligned with the 6th column of the matrix), and a shift of -3 means that it must be shifted 3 positions to the left (i.e. the 4th letter in the sequence must be aligned with the first column of the matrix).
`width`	The number of columns of the returned matrix for the `consensusMatrix` method for XStringSet objects. When `width=NULL` (the default), then this method returns a matrix that has just enough columns to have its last column aligned with the rightmost letter of all the sequences in `x` after those sequences have been shifted (see the `shift` argument above). This ensures that any wider consensus matrix would be a "padded with zeros" version of the matrix returned when `width=NULL`. The length of the returned sequence for the `consensusString` method for XStringSet objects.

alphabetFrequency, letterFrequency, and letterFrequencyInSlidingView are generic functions defined in the Biostrings package.

letterFrequency is similar to alphabetFrequency but specific to the letters of interest, hence more compact, especially with OR non-zero.

letterFrequencyInSlidingView yields the same result, on the sequence x, that letterFrequency would, if applied to the hypothetical (and possibly huge) XStringViews object consisting of all the intervals of length view.width on x. Taking advantage of the knowledge that successive "views" are nearly identical, for letter counting purposes, it is both lighter and faster.

For letterFrequencyInSlidingView, a masked (MaskedXString) object x is only supported through a cast to an (ordinary) XString such as unmasked (which includes its masked regions).

When consensusString is executed with a named character ambiguityMap argument, it weights each input string equally and assigns an equal probability to each of the base letters represented by an ambiguity letter. So for DNA and a threshold of 0.25, a "G" and an "R" would result in an "R" since 1/2 "G" + 1/2 "R" = 3/4 "G" + 1/4 "A" => "R"; two "G"'s and one "R" would result in a "G" since 2/3 "G" + 1/3 "R" = 5/6 "G" + 1/6 "A" => "G"; and one "A" and one "N" would result in an "N" since 1/2 "A" + 1/2 "N" = 5/8 "A" + 1/8 "C" + 1/8 "G" + 1/8 "T" => "N".

alphabetFrequency returns an integer vector when x is an XString or MaskedXString object. When x is an XStringSet or XStringViews object, then it returns an integer matrix with length(x) rows where the i-th row contains the frequencies for x[[i]]. If x is a DNA or RNA input, then the returned vector is named with the letters in the alphabet. If the baseOnly argument is TRUE, then the returned vector has only 5 elements: 4 elements corresponding to the 4 nucleotides + the 'other' element.

letterFrequency returns, similarly, an integer vector or matrix, but restricted and/or collated according to letters and OR.

letterFrequencyInSlidingView returns, for an XString object x of length (nchar) L, an integer matrix with L-view.width+1 rows, the i-th of which holding the letter frequencies of substring(x, i, i+view.width-1).

hasOnlyBaseLetters returns TRUE or FALSE indicating whether or not x contains only base letters (i.e. As, Cs, Gs and Ts for DNA input and As, Cs, Gs and Us for RNA input).

uniqueLetters returns a vector of 1-letter or empty strings. The empty string is used to represent the nul character if x happens to contain any. Note that this can only happen if the base class of x is BString.

An integer matrix with letters as row names for consensusMatrix.

A standard character string for consensusString.

H. Pag<c3><a8>s and P. Aboyoun; H. Jaffee for letterFrequency and letterFrequencyInSlidingView

alphabet, coverage, oligonucleotideFrequency, countPDict, XString-class, XStringSet-class, XStringViews-class, MaskedXString-class, strsplit

## ---------------------------------------------------------------------
## alphabetFrequency()
## ---------------------------------------------------------------------
data(yeastSEQCHR1)
yeast1 <- DNAString(yeastSEQCHR1)

alphabetFrequency(yeast1)
alphabetFrequency(yeast1, baseOnly=TRUE)

hasOnlyBaseLetters(yeast1)
uniqueLetters(yeast1)

## With input made of multiple sequences:
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
alphabetFrequency(probes[1:50], baseOnly=TRUE)
alphabetFrequency(probes, baseOnly=TRUE, collapse=TRUE)

## ---------------------------------------------------------------------
## letterFrequency()
## ---------------------------------------------------------------------
letterFrequency(probes[[1]], letters="ACGT", OR=0)
base_letters <- alphabet(probes, baseOnly=TRUE)
base_letters
letterFrequency(probes[[1]], letters=base_letters, OR=0)
base_letter_freqs <- letterFrequency(probes, letters=base_letters, OR=0)
head(base_letter_freqs)
GC_content <- letterFrequency(probes, letters="CG")
head(GC_content)
letterFrequency(probes, letters="CG", collapse=TRUE)

## ---------------------------------------------------------------------
## letterFrequencyInSlidingView()
## ---------------------------------------------------------------------
data(yeastSEQCHR1)
x <- DNAString(yeastSEQCHR1)
view.width <- 48
letters <- c("A", "CG")
two_columns <- letterFrequencyInSlidingView(x, view.width, letters)
head(two_columns)
tail(two_columns)
three_columns <- letterFrequencyInSlidingView(x, view.width, letters, OR=0)
head(three_columns)
tail(three_columns)
stopifnot(identical(two_columns[ , "C|G"],
                    three_columns[ , "C"] + three_columns[ , "G"]))

## Note that, alternatively, 'three_columns' can also be obtained by
## creating the views on 'x' (as a Views object) and by calling
## alphabetFrequency() on it. But, of course, that is be *much* less
## efficient (both, in terms of memory and speed) than using
## letterFrequencyInSlidingView():
v <- Views(x, start=seq_len(length(x) - view.width + 1), width=view.width)
v
three_columns2 <- alphabetFrequency(v, baseOnly=TRUE)[ , c("A", "C", "G")]
stopifnot(identical(three_columns2, three_columns))

## Set the width of the view to length(x) to get the global frequencies:
letterFrequencyInSlidingView(x, letters="ACGTN", view.width=length(x), OR=0)

## ---------------------------------------------------------------------
## consensus*()
## ---------------------------------------------------------------------
## Read in ORF data:
file <- system.file("extdata", "someORF.fa", package="Biostrings")
orf <- readDNAStringSet(file)

## To illustrate, the following example assumes the ORF data
## to be aligned for the first 10 positions (patently false):
orf10 <- DNAStringSet(orf, end=10)
consensusMatrix(orf10, baseOnly=TRUE)

## The following example assumes the first 10 positions to be aligned
## after some incremental shifting to the right (patently false):
consensusMatrix(orf10, baseOnly=TRUE, shift=0:6)
consensusMatrix(orf10, baseOnly=TRUE, shift=0:6, width=10)

## For the character matrix containing the "exploded" representation
## of the strings, do:
as.matrix(orf10, use.names=FALSE)

## consensusMatrix() can be used to just compute the alphabet frequency
## for each position in the input sequences:
consensusMatrix(probes, baseOnly=TRUE)

## After sorting, the first 5 probes might look similar (at least on
## their first bases):
consensusString(sort(probes)[1:5])
consensusString(sort(probes)[1:5], ambiguityMap = "N", threshold = 0.5)

## Consensus involving ambiguity letters in the input strings
consensusString(DNAStringSet(c("NNNN","ACTG")))
consensusString(DNAStringSet(c("AANN","ACTG")))
consensusString(DNAStringSet(c("ACAG","ACAR"))) 
consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG"))) 

## ---------------------------------------------------------------------
## C. RELATIONSHIP BETWEEN consensusMatrix() AND coverage()
## ---------------------------------------------------------------------
## Applying colSums() on a consensus matrix gives the coverage that
## would be obtained by piling up (after shifting) the input sequences
## on top of an (imaginary) reference sequence:
cm <- consensusMatrix(orf10, shift=0:6, width=10)
colSums(cm)

## Note that this coverage can also be obtained with:
as.integer(coverage(IRanges(rep(1, length(orf)), width(orf)), shift=0:6, width=10))

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

    A     C     G     T     M     R     W     S     Y     K     V     H     D 
69830 44643 45765 69970     0     0     0     0     0     0     0     0     0 
    B     N     -     +     . 
    0     0     0     0     0 
    A     C     G     T other 
69830 44643 45765 69970     0 
[1] TRUE
[1] "A" "C" "G" "T"
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

       A C  G  T other
 [1,]  6 8  4  7     0
 [2,]  7 6  5  7     0
 [3,]  7 6  6  6     0
 [4,]  5 7  7  6     0
 [5,]  5 7  8  5     0
 [6,]  8 5  7  5     0
 [7,]  9 7  3  6     0
 [8,]  8 6  7  4     0
 [9,] 10 5  3  7     0
[10,] 10 6  3  6     0
[11,]  7 6  7  5     0
[12,] 11 7  3  4     0
[13,]  7 8  2  8     0
[14,]  2 7  5 11     0
[15,]  6 6  7  6     0
[16,]  5 7  6  7     0
[17,]  6 6  8  5     0
[18,]  6 6  8  5     0
[19,]  7 7  7  4     0
[20,]  6 8  6  5     0
[21,]  6 7  7  5     0
[22,] 10 5  7  3     0
[23,] 11 5  6  3     0
[24,] 10 5  4  6     0
[25,]  8 6  1 10     0
[26,]  8 9  1  7     0
[27,]  6 9  4  6     0
[28,]  5 4 11  5     0
[29,]  7 7  5  6     0
[30,]  3 7  4 11     0
[31,]  9 6  2  8     0
[32,]  6 6  5  8     0
[33,]  4 6  8  7     0
[34,]  4 7  6  8     0
[35,]  8 8  4  5     0
[36,]  8 4  7  6     0
[37,]  8 6  5  6     0
[38,]  6 6  5  8     0
[39,]  5 7  8  5     0
[40,]  8 6  6  5     0
[41,]  7 7  7  4     0
[42,]  6 4  5 10     0
[43,]  8 8  5  4     0
[44,]  4 5  7  9     0
[45,]  5 6  8  6     0
[46,]  7 8  5  5     0
[47,]  7 8  4  6     0
[48,]  6 6  7  6     0
[49,]  4 6  8  7     0
[50,]  4 5  8  8     0
      A       C       G       T   other 
1676179 1671151 1594446 1693224       0 
A C G T 
6 8 4 7 
[1] "A" "C" "G" "T"
A C G T 
6 8 4 7 
     A C G T
[1,] 6 8 4 7
[2,] 7 6 5 7
[3,] 7 6 6 6
[4,] 5 7 7 6
[5,] 5 7 8 5
[6,] 8 5 7 5
     C|G
[1,]  12
[2,]  11
[3,]  12
[4,]  14
[5,]  15
[6,]  12
    C|G 
3265597 
      A C|G
[1,] 19  29
[2,] 19  29
[3,] 19  29
[4,] 18  30
[5,] 19  29
[6,] 18  30
          A C|G
[230156,] 0  30
[230157,] 0  30
[230158,] 0  29
[230159,] 0  29
[230160,] 0  30
[230161,] 0  30
      A  C G
[1,] 19 29 0
[2,] 19 29 0
[3,] 19 29 0
[4,] 18 30 0
[5,] 19 29 0
[6,] 18 30 0
          A C  G
[230156,] 0 0 30
[230157,] 0 0 30
[230158,] 0 0 29
[230159,] 0 0 29
[230160,] 0 0 30
[230161,] 0 0 30
Views on a 230208-letter DNAString subject
subject: CCACACCACACCCACACACCCACACACCACACCA...TGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
views:
            start    end width
       [1]      1     48    48 [CCACACCACACCCACACACCCA...CCACACCACACACCACACCACA]
       [2]      2     49    48 [CACACCACACCCACACACCCAC...CACACCACACACCACACCACAC]
       [3]      3     50    48 [ACACCACACCCACACACCCACA...ACACCACACACCACACCACACC]
       [4]      4     51    48 [CACCACACCCACACACCCACAC...CACCACACACCACACCACACCC]
       [5]      5     52    48 [ACCACACCCACACACCCACACA...ACCACACACCACACCACACCCA]
       ...    ...    ...   ... ...
  [230157] 230157 230204    48 [GGTGTGGGTGTGGTGTGGTGTG...TGTGGTGTGGGTGTGGTGTGTG]
  [230158] 230158 230205    48 [GTGTGGGTGTGGTGTGGTGTGT...GTGGTGTGGGTGTGGTGTGTGT]
  [230159] 230159 230206    48 [TGTGGGTGTGGTGTGGTGTGTG...TGGTGTGGGTGTGGTGTGTGTG]
  [230160] 230160 230207    48 [GTGGGTGTGGTGTGGTGTGTGG...GGTGTGGGTGTGGTGTGTGTGG]
  [230161] 230161 230208    48 [TGGGTGTGGTGTGGTGTGTGGG...GTGTGGGTGTGGTGTGTGTGGG]
         A     C     G     T N
[1,] 69830 44643 45765 69970 0
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A        2    2    2    0    4    3    3    3    2     1
C        3    1    2    2    2    1    0    0    2     3
G        1    1    1    2    1    0    3    3    1     2
T        1    3    2    3    0    3    1    1    2     1
other    0    0    0    0    0    0    0    0    0     0
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
A        1    0    0    0    2    1    4    2    4     1     3     1     2
C        0    1    1    2    1    2    1    1    0     2     3     0     1
G        0    0    0    0    1    2    0    3    2     1     0     2     1
T        0    1    2    2    1    1    2    1    1     3     0     2     0
other    0    0    0    0    0    0    0    0    0     0     0     0     0
      [,14] [,15] [,16]
A         1     0     0
C         0     1     0
G         2     0     1
T         0     1     0
other     0     0     0
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A        1    0    0    0    2    1    4    2    4     1
C        0    1    1    2    1    2    1    1    0     2
G        0    0    0    0    1    2    0    3    2     1
T        0    1    2    2    1    1    2    1    1     3
other    0    0    0    0    0    0    0    0    0     0
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "A"  "C"  "T"  "T"  "G"  "T"  "A"  "A"  "A"  "T"  
[2,] "T"  "T"  "C"  "C"  "A"  "A"  "G"  "G"  "C"  "C"  
[3,] "C"  "T"  "T"  "C"  "A"  "T"  "G"  "T"  "C"  "A"  
[4,] "C"  "A"  "C"  "T"  "C"  "A"  "T"  "A"  "T"  "C"  
[5,] "A"  "G"  "A"  "G"  "A"  "A"  "A"  "G"  "A"  "G"  
[6,] "G"  "T"  "G"  "T"  "C"  "C"  "G"  "G"  "G"  "C"  
[7,] "C"  "A"  "A"  "G"  "A"  "T"  "A"  "A"  "T"  "G"  
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
A     89191 92082 75796 66864 58251 60832 59048 57840 58947 59089 61149 65433
C     22497 44721 59982 63857 76992 75497 72628 75823 79133 75208 72788 70599
G     98027 64220 67537 67201 69874 60639 66245 57260 61527 67086 62131 55959
T     55685 64377 62085 67478 60283 68432 67479 74477 65793 64017 69332 73409
other     0     0     0     0     0     0     0     0     0     0     0     0
      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
A     72617 68041 56147 59231 61046 58223 59769 57775 61656 63262 71305 88914
C     54975 66917 73696 71279 70130 74022 74487 76839 75608 73229 70606 54117
G     50535 51712 67555 67205 62632 66451 66206 65394 63533 65623 60266 58448
T     87273 78730 68002 67685 71592 66704 64938 65392 64603 63286 63223 63921
other     0     0     0     0     0     0     0     0     0     0     0     0
      [,25]
A     93671
C     45521
G     51180
T     75028
other     0
[1] "AAAAAACARSCYYMRGSMSGYTYRW"
[1] "AAAAAACANNCNCNAGNAGNCNCNN"
[1] "ACTG"
[1] "AMTG"
[1] "ACAR"
[1] "ACAG"
 [1] 1 2 3 4 5 6 7 7 7 7
 [1] 1 2 3 4 5 6 7 7 7 7