Description Usage Arguments Details Value Author(s) See Also Examples
View source: R/scan_sequences.R
For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.
1 2 3 4 |
motifs |
See |
sequences |
|
threshold |
|
threshold.type |
|
RC |
|
use.freq |
|
verbose |
|
nthreads |
|
motif_pvalue.k |
|
use.gaps |
|
allow.nonfinite |
|
warn.NA |
|
calc.pvals |
|
Similar to Biostrings::matchPWM()
, the scanning method uses
logodds scoring. (To see the scoring matrix for any motif, simply
run convert_type(motif, "PWM")
. For a multifreq
scoring
matrix: apply(motif["multifreq"][["2"]], 2, ppm_to_pwm)
). In order
to score a sequence, at each position within a sequence of length equal
to the length of the motif, the scores for each base are summed. If the
score sum is above the desired threshold, it is kept.
If threshold.type = 'logodds'
, then the threshold
value is multiplied
by the maximum possible motif scores. To calculate the
maximum possible scores a motif (of type PWM) manually, run
motif_score(motif, 1)
. If threshold.type = 'pvalue'
,
then threshold logodds scores are generated using motif_pvalue()
.
Finally, if threshold.type = 'logodds.abs'
, then the exact values
provided will be used as thresholds.
Non-standard letters (such as "N", "+", "-", ".", etc in DNAString
objects) will be safely ignored, resulting only in a warning and a very
minor performance cost. This can used to scan
masked sequences. See Biostrings::mask()
for masking sequences
(generating MaskedXString
objects), and Biostrings::injectHardMask()
to recover masked XStringSet
objects for use with scan_sequences()
.
There is also a provided wrapper function which performs both steps: mask_seqs()
.
When calc.pvals = TRUE
, motif_pvalue()
will calculate the probabilities
of getting the input scores or higher, which is why it can take time to
calculate the P-values. If you simply wish to calculate the
probabilities of getting individual matches based on background frequencies,
then the following code can be used to achieve
this (using the list of input motifs and scan_sequences()
results):
mapply(prob_match, motifs[scanRes$motif.i], scanRes$match)
. Of course
this only matters if you do not have uniform background frequencies, or
else the probability of each match is simply (1 / nrow(motif))^ncol(motif)
.
DataFrame
with each row representing one hit. If the input
sequences are DNAStringSet
or
RNAStringSet
, then an
additional column with the strand is included. Function args are stored
in the metadata
slot.
Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca
add_multifreq()
, Biostrings::matchPWM()
,
enrich_motifs()
, motif_pvalue()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | ## any alphabet can be used
## Not run:
set.seed(1)
alphabet <- paste(c(letters), collapse = "")
motif <- create_motif("hello", alphabet = alphabet)
sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000)
scan_sequences(motif, sequences)
## End(Not run)
## Sequence masking:
if (R.Version()$arch != "i386") {
library(Biostrings)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
seq <- mask_seqs(ArabidopsisPromoters, "AAAAA")
scan_sequences(ArabidopsisMotif, seq)
# A warning regarding the presence of non-standard letters will be given,
# but can be safely ignored in this case.
}
## Converting results to a GRanges object:
## Not run:
res <- scan_sequences(ArabidopsisMotif, seq)
library(GenomicRanges)
makeGRangesFromDataFrame(res, seqnames.field = "sequence",
keep.extra.columns = TRUE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.