motif_pvalue: Motif P-value and scoring utility

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/motif_pvalue.R

Description

For calculating p-values/logodds scores for any number of motifs.

Usage

1
2
3
motif_pvalue(motifs, score, pvalue, bkg.probs, use.freq = 1, k = 8,
  nthreads = 1, rand.tries = 10, rng.seed = sample.int(10000, 1),
  allow.nonfinite = FALSE)

Arguments

motifs

See convert_motifs() for acceptable motif formats.

score

numeric Get a p-value for a motif from a logodds score.

pvalue

numeric Get a logodds score for a motif from a p-value.

bkg.probs

numeric, list If supplying individual background probabilities for each motif, a list. If missing, retrieves the background from the motif bkg slot. Note that this only influences calculating p-values from an input score; calculating a score from an input p-value currently assumes a uniform background.

use.freq

numeric(1) By default uses the regular motif matrix; otherwise uses the corresponding multifreq matrix. Max is 3.

k

numeric(1) For speed, scores/p-values can be approximated after subsetting the motif every k columns. If k is a value equal or higher to the size of input motif(s), then the calculations are (nearly) exact. The default, 8, is recommended to those looking for a good tradeoff between speed and accuracy for jobs requiring repeated calculations.

nthreads

numeric(1) Run motif_pvalue() in parallel with nthreads threads. nthreads = 0 uses all available threads.

rand.tries

numeric(1) When ncol(motif) < k, an approximation is used. This involves randomly approximating the overall motif score distribution. To increase accuracy, the distribution is approximated rand.tries times and the final scores averaged.

rng.seed

numeric(1) In order to allow motif_pvalue() to perform C++ level parallelisation, it must work independently from R. This means it cannot communicate with R to get/set the R RNG state. To get around this, the RNG seed used by the C++ function can be set with rng.seed. To make sure each thread gets a different seed however, the seed is multiplied with the iteration count. For example: when working with two motifs, the second motif gets the following seed: rng.seed * 2. The default is to pick a random number as chosen by sample(), which effectively makes motif_pvalue() dependent on the R RNG state.

allow.nonfinite

logical(1) If FALSE, then apply a pseudocount if non-finite values are found in the PWM. Note that if the motif has a pseudocount greater than zero and the motif is not currently of type PWM, then this parameter has no effect as the pseudocount will be applied automatically when the motif is converted to a PWM internally. This value is set to FALSE by default in order to stay consistent with pre-version 1.8.0 behaviour.

Details

Calculating p-values for motifs can be very computationally intensive. This is due to how p-values must be calculated: for a given score, all possible sequences which score equal or higher must be found, and the probability for each of these sequences (based on background probabilities) summed. For a DNA motif of length 10, the number of possible unique sequences is 4^10 = 1,048,576. Finding all possible sequences higher than a given score can be done very efficiently and quickly with a branch-and-bound algorithm, but as the motif length increases even this calculation becomes impractical. To get around this, the p-value calculation can be approximated.

In order to calculate p-values for longer motifs, this function uses the approximation proposed by \insertCitepvalues;textualuniversalmotif, where the motif is subset, p-values calculated for the subsets, and finally combined for a total p-value. The smaller the size of the subsets, the faster the calculation; but also, the bigger the approximation. This can be controlled by setting k. In fact, for smaller motifs (< 13 positions) calculating exact p-values can be done individually in reasonable time by setting k = 12.

To calculate a score from a P-value, all possible scores are calculated and the (1 - pvalue) * 100 nth percentile score returned. When k < ncol(motif), the complete set of scores is instead approximated by randomly adding up all possible scores from each subset. It is important to keep in mind that no consideration is given to background frequencies in the score calculator. Note that this approximation can actually be potentially quite expensive at times and even slower than the exact version; for jobs requiring lots of repeat calculations, a bit of benchmarking beforehand can be useful to find the optimal settings.

To get an idea as to how the score calculator works (without approximation), try the following code with your motif (be careful with longer motifs):

quantile(get_scores(motif), probs = 0.99)

Value

numeric A vector of scores/p-values.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

\insertRef

pvaluesuniversalmotif

See Also

motif_score()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
if (R.Version()$arch != "i386") {

## P-value/score calculations are performed using the PWM version of the
## motif
data(examplemotif)

## Get a minimum score based on a p-value
motif_pvalue(examplemotif, pvalue = 0.001)

## Get the probability of a particular sequence hit
motif_pvalue(examplemotif, score = 0)

## The calculations can be performed for multiple motifs
motif_pvalue(list(examplemotif, examplemotif), pvalue = c(0.001, 0.0001))

## Compare score thresholds and P-value:
scores <- motif_score(examplemotif, c(0.6, 0.7, 0.8, 0.9))
motif_pvalue(examplemotif, scores)

## Calculate the probability of getting a certain match or better:
TATATAT <- score_match(examplemotif, "TATATAT")
TATATAG <- score_match(examplemotif, "TATATAG")
motif_pvalue(examplemotif, TATATAT)
motif_pvalue(examplemotif, TATATAG)

## Get all possible matches by P-value:
get_matches(examplemotif, motif_pvalue(examplemotif, pvalue = 0.0001))
}

universalmotif documentation built on April 8, 2021, 6 p.m.