Description Usage Arguments Details Value Author(s) References See Also Examples
Switch between position count matrix (PCM), position probability matrix (PPM), position weight matrix (PWM), and information count matrix (ICM) types. See the "Introduction to sequence motifs" vignette for details.
1 2 | convert_type(motifs, type, pseudocount, nsize_correction = FALSE,
relative_entropy = FALSE)
|
motifs |
See |
type |
|
pseudocount |
|
nsize_correction |
|
relative_entropy |
|
Position count matrix (PCM), also known as position frequency matrix (PFM). For n sequences from which the motif was built, each position is represented by the numbers of each letter at that position. In theory all positions should have sums equal to n, but not all databases are this consistent. If converting from another type to PCM, column sums will be equal to the 'nsites' slot. If empty, 100 is used.
Position probability matrix (PPM), also known as position frequency
matrix (PFM). At each position, the probability of individual letters
is calculated by dividing the count for that letter by the total sum of
counts at that position (letter_count / position_total
).
As a result, each position will sum to 1. Letters with counts of 0 will
thus have a probability of 0, which can be undesirable when searching for
motifs in a set of sequences. To avoid this a pseudocount can be added
((letter_count + pseudocount) / (position_total + pseudocount)
).
Position weight matrix (PWM; \insertCitepwm;textualuniversalmotif),
also known as position-specific weight
matrix (PSWM), position-specific scoring matrix (PSSM), or
log-odds matrix. At each position, each letter is represented by it's
log-likelihood (log2(letter_probability / background_probility)
),
which is normalized using the background letter frequencies. A PWM matrix
is constructed from a PPM. If any position has 0-probability letters to
which pseudocounts were not added, then the final log-likelihood of these
letters will be -Inf
.
Information content matrix (ICM; \insertCiteicm;textualuniversalmotif).
An ICM is a PPM where each letter probability is multiplied by the total
information content at that position. The information content of each
position is determined as: totalIC - Hi
, where the total information
totalIC
totalIC <- log2(alphabet_length)
, and the Shannon entropy
\insertCiteshannonuniversalmotif for a specific
position (Hi)
Hi <- -sum(sapply(alphabet_frequencies, function(x) x * log(2))
.
As a result, the total sum or height of each position is representative of
it's sequence conservation, measured in the unit 'bits', which is a unit
of energy (\insertCitebits;textualuniversalmotif; see
https://fr-s-schneider.ncifcrf.gov/logorecommendations.html
for more information). However not all programs will calculate
information content the same. Some will 'correct' the total information
content at each position using a correction factor as described by
\insertCitecorrection;textualuniversalmotif. This correction can
applied by setting nsize_correction = TRUE
, however it will only
be applied if the 'nsites' slot is not empty. This is done using
TFBSTools:::schneider_correction
\insertCitetfbstoolsuniversalmotif. As such, converting from an ICM to
which some form of correction has been applied will result in a
PCM/PPM/PWM with slight inaccuracies.
Another method of calculating information content is calculating the
relative entropy, also known as Kullback-Leibler divergence
\insertCitekluniversalmotif. This accounts for background
frequencies, which
can be useful for genomes with a heavy imbalance in letter frequencies.
For each position, the individual letter frequencies are calculated as
letter_freq * log2(letter_freq / bkg_freq)
. When calculating
information content using Shannon entropy, the maximum content for
each position will always be log2(alphabet_length)
. This does
not hold for information content calculated as relative entropy.
Please note that conversion from ICM assumes the information content
was not calculated as relative entropy.
See convert_motifs()
for possible output motif objects.
Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca
kluniversalmotif
\insertRefpseudouniversalmotif
\insertRefcorrectionuniversalmotif
\insertReficmuniversalmotif
\insertRefbitsuniversalmotif
\insertRefshannonuniversalmotif
\insertRefpwmuniversalmotif
\insertReftfbstoolsuniversalmotif
1 2 3 4 5 6 7 8 9 10 | jaspar.pcm <- read_jaspar(system.file("extdata", "jaspar.txt",
package = "universalmotif"))
## The motifs pseudocounts are 1: these will be used in the PCM->PPM
## calculation
jaspar.pwm <- convert_type(jaspar.pcm, type = "PPM")
## Setting pseudocount to 0 will prevent any correction from being
## applied to PPM/PWM matrices, overriding the motifs own pseudocounts
jaspar.pwm <- convert_type(jaspar.pcm, type = "PWM", pseudocount = 0)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.