convert_type | R Documentation |
Switch between position count matrix (PCM), position probability matrix
(PPM), position weight matrix (PWM), and information count matrix (ICM)
types. See the "Introduction to sequence motifs" vignette for details. Please
also note that type conversion occurs implicitly throughout the
universalmotif
package, so there is generally no need to perform this
manual conversion. Also please be aware that the message concerning
pseudocount-adjusting motifs can be disabled via
options(pseudocount.warning=FALSE)
.
convert_type(motifs, type, pseudocount, nsize_correction = FALSE,
relative_entropy = FALSE)
motifs |
See |
type |
|
pseudocount |
|
nsize_correction |
|
relative_entropy |
|
Position count matrix (PCM), also known as position frequency matrix (PFM). For n sequences from which the motif was built, each position is represented by the numbers of each letter at that position. In theory all positions should have sums equal to n, but not all databases are this consistent. If converting from another type to PCM, column sums will be equal to the 'nsites' slot. If empty, 100 is used.
Position probability matrix (PPM), also known as position frequency
matrix (PFM). At each position, the probability of individual letters
is calculated by dividing the count for that letter by the total sum of
counts at that position (letter_count / position_total
).
As a result, each position will sum to 1. Letters with counts of 0 will
thus have a probability of 0, which can be undesirable when searching for
motifs in a set of sequences. To avoid this a pseudocount can be added
((letter_count + pseudocount) / (position_total + pseudocount)
).
Position weight matrix (PWM; Stormo et al. (1982)),
also known as position-specific weight
matrix (PSWM), position-specific scoring matrix (PSSM), or
log-odds matrix. At each position, each letter is represented by it's
log-likelihood (log2(letter_probability / background_probility)
),
which is normalized using the background letter frequencies. A PWM matrix
is constructed from a PPM. If any position has 0-probability letters to
which pseudocounts were not added, then the final log-likelihood of these
letters will be -Inf
.
Information content matrix (ICM; Schneider and Stephens 1990).
An ICM is a PPM where each letter probability is multiplied by the total
information content at that position. The information content of each
position is determined as: totalIC - Hi
, where the total information
totalIC
totalIC <- log2(alphabet_length)
, and the Shannon entropy
(Shannon 1948) for a specific
position (Hi)
Hi <- -sum(sapply(alphabet_frequencies, function(x) x * log(2))
.
As a result, the total sum or height of each position is representative of
it's sequence conservation, measured in the unit 'bits', which is a unit
of energy (Schneider 1991; see
https://fr-s-schneider.ncifcrf.gov/logorecommendations.html
for more information). However not all programs will calculate
information content the same. Some will 'correct' the total information
content at each position using a correction factor as described by
Schneider et al. (1986). This correction can
applied by setting nsize_correction = TRUE
, however it will only
be applied if the 'nsites' slot is not empty. This is done using
TFBSTools:::schneider_correction
(Tan and Lenhard 2016). As such, converting from an ICM to
which some form of correction has been applied will result in a
PCM/PPM/PWM with slight inaccuracies.
Another method of calculating information content is calculating the
relative entropy, also known as Kullback-Leibler divergence
(Kullback and Leibler 1951). This accounts for background
frequencies, which
can be useful for genomes with a heavy imbalance in letter frequencies.
For each position, the individual letter frequencies are calculated as
letter_freq * log2(letter_freq / bkg_freq)
. When calculating
information content using Shannon entropy, the maximum content for
each position will always be log2(alphabet_length)
. This does
not hold for information content calculated as relative entropy.
Please note that conversion from ICM assumes the information content
was not calculated as relative entropy.
See convert_motifs()
for possible output motif objects.
Benjamin Jean-Marie Tremblay, benjamin.tremblay@uwaterloo.ca
Kullback S, Leibler RA (1951). “On information and sufficiency.” The Annals of Mathematical Statistics, 22, 79-86.
Nishida K, Frith MC, Nakai K (2009). “Pseudocounts for transcription factor binding sites.” Nucleic Acids Research, 37, 939-944.
Schneider TD, Stormo GD, Gold L, Ehrenfeucht A (1986). “Information content of binding sites on nucleotide sequences.” Journal of Molecular Biology, 188, 415-431.
Schneider TD, Stephens RM (1990). “Sequence Logos: A New Way to Display Consensus Sequences.” Nucleic Acids Research, 18, 6097-6100.
Schneider TD (1991). “Theory of Molecular Machines. II. Energy Dissipation from Molecular Machines.” Journal of Theoretical Biology, 148, 125-137.
Shannon CE (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal, 27, 379-423.
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982). “Use of the Perceptron algorithm to distinguish translational initiation sites in E. coli.” Nucleic Acids Research, 10, 2997-3011.
Tan G, Lenhard B (2016). “TFBSTools: an R/Bioconductor package for transcription factor binding site analysis.” Bioinformatics, 32, 1555-1556. doi: 10.1093/bioinformatics/btw024.
convert_motifs()
jaspar.pcm <- read_jaspar(system.file("extdata", "jaspar.txt",
package = "universalmotif"))
## The motifs pseudocounts are 1: these will be used in the PCM->PPM
## calculation
jaspar.pwm <- convert_type(jaspar.pcm, type = "PPM")
## Setting pseudocount to 0 will prevent any correction from being
## applied to PPM/PWM matrices, overriding the motifs own pseudocounts
jaspar.pwm <- convert_type(jaspar.pcm, type = "PWM", pseudocount = 0)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.