convert_type: Convert universalmotif type.

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/convert_type.R

Description

Switch between position count matrix (PCM), position probability matrix (PPM), position weight matrix (PWM), and information count matrix (ICM) types. See the "Introduction to sequence motifs" vignette for details.

Usage

1
2
convert_type(motifs, type, pseudocount, nsize_correction = FALSE,
  relative_entropy = FALSE)

Arguments

motifs

See convert_motifs() for acceptable formats.

type

character(1) One of c('PCM', 'PPM', 'PWM', 'ICM').

pseudocount

numeric(1) Correction to be applied to prevent -Inf from appearing in PWM matrices. If missing, the pseudocount stored in the universalmotif 'pseudocount' slot will be used.

nsize_correction

logical(1) If true, the ICM at each position will be corrected to account for small sample sizes. Only used if relative_entropy = FALSE.

relative_entropy

logical(1) If true, the ICM will be calculated as relative entropy. See details.

Details

Position count matrix (PCM), also known as position frequency matrix (PFM). For n sequences from which the motif was built, each position is represented by the numbers of each letter at that position. In theory all positions should have sums equal to n, but not all databases are this consistent. If converting from another type to PCM, column sums will be equal to the 'nsites' slot. If empty, 100 is used.

Position probability matrix (PPM), also known as position frequency matrix (PFM). At each position, the probability of individual letters is calculated by dividing the count for that letter by the total sum of counts at that position (letter_count / position_total). As a result, each position will sum to 1. Letters with counts of 0 will thus have a probability of 0, which can be undesirable when searching for motifs in a set of sequences. To avoid this a pseudocount can be added ((letter_count + pseudocount) / (position_total + pseudocount)).

Position weight matrix (PWM; \insertCitepwm;textualuniversalmotif), also known as position-specific weight matrix (PSWM), position-specific scoring matrix (PSSM), or log-odds matrix. At each position, each letter is represented by it's log-likelihood (log2(letter_probability / background_probility)), which is normalized using the background letter frequencies. A PWM matrix is constructed from a PPM. If any position has 0-probability letters to which pseudocounts were not added, then the final log-likelihood of these letters will be -Inf.

Information content matrix (ICM; \insertCiteicm;textualuniversalmotif). An ICM is a PPM where each letter probability is multiplied by the total information content at that position. The information content of each position is determined as: totalIC - Hi, where the total information totalIC

totalIC <- log2(alphabet_length), and the Shannon entropy \insertCiteshannonuniversalmotif for a specific position (Hi)

Hi <- -sum(sapply(alphabet_frequencies, function(x) x * log(2)).

As a result, the total sum or height of each position is representative of it's sequence conservation, measured in the unit 'bits', which is a unit of energy (\insertCitebits;textualuniversalmotif; see https://fr-s-schneider.ncifcrf.gov/logorecommendations.html for more information). However not all programs will calculate information content the same. Some will 'correct' the total information content at each position using a correction factor as described by \insertCitecorrection;textualuniversalmotif. This correction can applied by setting nsize_correction = TRUE, however it will only be applied if the 'nsites' slot is not empty. This is done using TFBSTools:::schneider_correction \insertCitetfbstoolsuniversalmotif. As such, converting from an ICM to which some form of correction has been applied will result in a PCM/PPM/PWM with slight inaccuracies.

Another method of calculating information content is calculating the relative entropy, also known as Kullback-Leibler divergence \insertCitekluniversalmotif. This accounts for background frequencies, which can be useful for genomes with a heavy imbalance in letter frequencies. For each position, the individual letter frequencies are calculated as letter_freq * log2(letter_freq / bkg_freq). When calculating information content using Shannon entropy, the maximum content for each position will always be log2(alphabet_length). This does not hold for information content calculated as relative entropy. Please note that conversion from ICM assumes the information content was not calculated as relative entropy.

Value

See convert_motifs() for possible output motif objects.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

\insertRef

kluniversalmotif

\insertRef

pseudouniversalmotif

\insertRef

correctionuniversalmotif

\insertRef

icmuniversalmotif

\insertRef

bitsuniversalmotif

\insertRef

shannonuniversalmotif

\insertRef

pwmuniversalmotif

\insertRef

tfbstoolsuniversalmotif

See Also

convert_motifs()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
jaspar.pcm <- read_jaspar(system.file("extdata", "jaspar.txt",
                                      package = "universalmotif"))

## The motifs pseudocounts are 1: these will be used in the PCM->PPM
## calculation
jaspar.pwm <- convert_type(jaspar.pcm, type = "PPM")

## Setting pseudocount to 0 will prevent any correction from being
## applied to PPM/PWM matrices, overriding the motifs own pseudocounts
jaspar.pwm <- convert_type(jaspar.pcm, type = "PWM", pseudocount = 0)

universalmotif documentation built on April 8, 2021, 6 p.m.