View source: R/scan_sequences.R
scan_sequences | R Documentation |
For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.
scan_sequences(motifs, sequences, threshold = 1e-04,
threshold.type = c("pvalue", "qvalue", "logodds", "logodds.abs"),
RC = FALSE, use.freq = 1, verbose = 0, nthreads = 1,
motif_pvalue.k = 8, use.gaps = TRUE, allow.nonfinite = FALSE,
warn.NA = TRUE, calc.pvals = TRUE, return.granges = FALSE,
no.overlaps = FALSE, no.overlaps.by.strand = FALSE,
no.overlaps.strat = c("score", "order"), respect.strand = FALSE,
motif_pvalue.method = c("dynamic", "exhaustive"),
calc.qvals = calc.pvals, calc.qvals.method = c("fdr", "BH",
"bonferroni"))
motifs |
See |
sequences |
|
threshold |
|
threshold.type |
|
RC |
|
use.freq |
|
verbose |
|
nthreads |
|
motif_pvalue.k |
|
use.gaps |
|
allow.nonfinite |
|
warn.NA |
|
calc.pvals |
|
return.granges |
|
no.overlaps |
|
no.overlaps.by.strand |
|
no.overlaps.strat |
|
respect.strand |
|
motif_pvalue.method |
|
calc.qvals |
|
calc.qvals.method |
|
Similar to Biostrings::matchPWM()
, the scanning method uses
logodds scoring. (To see the scoring matrix for any motif, simply
run convert_type(motif, "PWM")
. For a multifreq
scoring
matrix: apply(motif["multifreq"][["2"]], 2, ppm_to_pwm)
). In order
to score a sequence, at each position within a sequence of length equal
to the length of the motif, the scores for each base are summed. If the
score sum is above the desired threshold, it is kept.
If threshold.type = 'logodds'
, then the threshold
value is multiplied
by the maximum possible motif scores. To calculate the
maximum possible scores a motif (of type PWM) manually, run
motif_score(motif, 1)
. If threshold.type = 'pvalue'
,
then threshold logodds scores are generated using motif_pvalue()
.
Finally, if threshold.type = 'logodds.abs'
, then the exact values
provided will be used as thresholds. Finally, if threshold.type = 'qvalue'
,
then the threshold is calculated as if threshold.type = 'pvalue'
and the
final set of hits are filtered based on their calculated Q-value. (Note:
this means that the thresh.score
column will be incorrect!) This is done
since most Q-values cannot be calculated prior to scanning. If you are
running a very large job, it may be wise to use a P-value threshold
followed by manually filtering by Q-value; this will avoid the scanning
have to parse the larger number of hits from the internally-lowered threshold.
Non-standard letters (such as "N", "+", "-", ".", etc in DNAString
objects) will be safely ignored, resulting only in a warning and a very
minor performance cost. This can used to scan
masked sequences. See Biostrings::mask()
for masking sequences
(generating MaskedXString
objects), and Biostrings::injectHardMask()
to recover masked XStringSet
objects for use with scan_sequences()
.
There is also a provided wrapper function which performs both steps: mask_seqs()
.
DataFrame
, GRanges
with each row representing one hit. If the input
sequences are DNAStringSet
or RNAStringSet
,
then an additional column with the strand is included. Function args are
stored in the metadata
slot. If return.granges = TRUE
then a GRanges
object is returned.
Benjamin Jean-Marie Tremblay, benjamin.tremblay@uwaterloo.ca
Grant CE, Bailey TL, Noble WS (2011). "FIMO: scanning for occurrences of a given motif." Bioinformatics, 27, 1017-1018.
Hartmann H, Guthohrlein EW, Siebert M, Soding SLJ (2013). “P-value-based regulatory motif discovery using positional weight matrices.” Genome Research, 23, 181-194.
Noble WS (2009). "How does multiple testing work?" Nature Biotechnology, 27, 1135-1137.
add_multifreq()
, Biostrings::matchPWM()
,
enrich_motifs()
, motif_pvalue()
## any alphabet can be used
## Not run:
set.seed(1)
alphabet <- paste(c(letters), collapse = "")
motif <- create_motif("hello", alphabet = alphabet)
sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000)
scan_sequences(motif, sequences)
## End(Not run)
## Sequence masking:
if (R.Version()$arch != "i386") {
library(Biostrings)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
seq <- mask_seqs(ArabidopsisPromoters, "AAAAA")
scan_sequences(ArabidopsisMotif, seq)
# A warning regarding the presence of non-standard letters will be given,
# but can be safely ignored in this case.
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.