motif_pvalue | R Documentation |
For calculating P-values and logodds scores from P-values for any number of motifs.
motif_pvalue(motifs, score, pvalue, bkg.probs, use.freq = 1, k = 8,
nthreads = 1, rand.tries = 10, rng.seed = sample.int(10000, 1),
allow.nonfinite = FALSE, method = c("dynamic", "exhaustive"))
motifs |
See |
score |
|
pvalue |
|
bkg.probs |
|
use.freq |
|
k |
|
nthreads |
|
rand.tries |
|
rng.seed |
|
allow.nonfinite |
|
method |
|
A note regarding vectorizing the calculation when method = "dynamic"
(no
vectorization is possible with method = "exhaustive"
): to avoid performing
the P-value/score calculation repeatedly for individual motifs, provide the
score
/pvalue
arguments as a list, with each entry corresponding to the
scores/P-values to be calculated for the respective motifs provided to
motifs
. If you simply provide a list of repeating motifs and a single
numeric vector of corresponding input scores/P-values, then motif_pvalue()
will not vectorize. See the Examples section.
One of the algorithms available to motif_pvalue()
to calculate scores or
P-values is the dynamic programming algorithm used by FIMO (Grant et al., 2011).
In this method, a small range of possible scores from the possible miminum and maximum
is created and the cumulative probability of each score in this distribution is
incrementally
calculated using the logodds scores and the background probabilities. This
distribution of scores and associated P-values can be used to calculate P-values
or scores for any input, any number of times. This method scales well with large
motifs, and multifreq
representations. The only downside is that it is
incompatible with allow.nonfinite = TRUE
, as this would not allow for the
creation of the initial range of scores. Although described for a different
purpose, the basic premise of the dynamic programming algorithm is also
described in Gupta et al. (2007).
Calculating P-values exhaustively for motifs can be very computationally intensive. This is due to how P-values must be calculated: for a given score, all possible sequences which score equal or higher must be found, and the probability for each of these sequences (based on background probabilities) summed. For a DNA motif of length 10, the number of possible unique sequences is 4^10 = 1,048,576. Finding all possible sequences higher than a given score can be done very efficiently and quickly with a branch-and-bound algorithm, but as the motif length increases even this calculation becomes impractical. To get around this, the P-value calculation can be approximated.
In order to calculate P-values for longer motifs, this function uses the
approximation proposed by Hartmann et al. (2013), where
the motif is subset, P-values calculated for the subsets, and finally
combined for a total P-value. The smaller the size of the subsets, the
faster the calculation; but also, the bigger the approximation. This can be
controlled by setting k
. In fact, for smaller motifs (< 13 positions)
calculating exact P-values can be done individually in reasonable time by
setting k = 12
.
To calculate a score from a P-value, all possible scores are calculated
and the (1 - pvalue) * 100
nth percentile score returned.
When k < ncol(motif)
, the complete set of scores is instead approximated
by randomly adding up all possible scores from each subset.
Note that this approximation
can actually be potentially quite expensive at times and even slower than
the exact version; for jobs requiring lots of repeat calculations, a bit of
benchmarking beforehand can be useful to find the optimal settings.
Please note that bugs are more likely to occur when using the exhaustive
method, as the algorithm contains several times more code compared to the
dynamic method. Unless you have a strong need to use allow.nonfinite = TRUE
then avoid using this method.
numeric
, list
A vector or list of vectors of scores/P-values.
Benjamin Jean-Marie Tremblay, benjamin.tremblay@uwaterloo.ca
Grant CE, Bailey TL, Noble WS (2011). "FIMO: scanning for occurrences of a given motif." Bioinformatics, 27, 1017-1018.
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS (2007). "Quantifying similarity between motifs." Genome Biology, 8, R24.
Hartmann H, Guthohrlein EW, Siebert M, Soding SLJ (2013). “P-value-based regulatory motif discovery using positional weight matrices.” Genome Research, 23, 181-194.
get_matches()
, get_scores()
, motif_range()
, motif_score()
,
prob_match()
, prob_match_bkg()
, score_match()
if (R.Version()$arch != "i386") {
## P-value/score calculations are performed using the PWM version of the
## motif
data(examplemotif)
## Get a minimum score based on a P-value
motif_pvalue(examplemotif, pvalue = 0.001)
## Get the probability of a particular sequence hit
motif_pvalue(examplemotif, score = 0)
## The calculations can be performed for multiple motifs
motif_pvalue(c(examplemotif, examplemotif), pvalue = c(0.001, 0.0001))
## Compare score thresholds and P-value:
scores <- motif_score(examplemotif, c(0.6, 0.7, 0.8, 0.9))
motif_pvalue(examplemotif, scores)
## Calculate the probability of getting a certain match or better:
TATATAT <- score_match(examplemotif, "TATATAT")
TATATAG <- score_match(examplemotif, "TATATAG")
motif_pvalue(examplemotif, TATATAT)
motif_pvalue(examplemotif, TATATAG)
## Get all possible matches by P-value:
get_matches(examplemotif, motif_pvalue(examplemotif, pvalue = 0.0001))
## Vectorize the calculation for multiple motifs and scores/P-values:
m <- create_motif()
motif_pvalue(c(examplemotif, m), list(1:5, 2:3))
## The non-vectorized equivalent:
motif_pvalue(
c(rep(list(examplemotif), 5), rep(list(m), 2)), c(1:5, 2:3)
)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.