scan_sequences: Scan sequences for matches to input motifs.
In universalmotif: Import, Modify, and Export Motifs with R

Description Usage Arguments Details Value Author(s) See Also Examples

For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.

scan_sequences(motifs, sequences, threshold = 0.001,
  threshold.type = "pvalue", RC = FALSE, use.freq = 1, verbose = 0,
  nthreads = 1, motif_pvalue.k = 8, use.gaps = TRUE,
  allow.nonfinite = FALSE, warn.NA = TRUE, calc.pvals = FALSE)

`motifs`	See `convert_motifs()` for acceptable motif formats.
`sequences`	`XStringSet` Sequences to scan. Alphabet should match motif.
`threshold`	`numeric(1)` See details.
`threshold.type`	`character(1)` One of `c('logodds', 'logodds.abs', 'pvalue')`. See details.
`RC`	`logical(1)` If `TRUE`, check reverse complement of input sequences.
`use.freq`	`numeric(1)` The default, 1, uses the motif matrix (from the `motif['motif']` slot) to search for sequences. If a higher number is used, then the matching k-let matrix from the `motif['multifreq']` slot is used. See `add_multifreq()`.
`verbose`	`numeric(1)` Describe progress, from none (`0`) to verbose (`3`).
`nthreads`	`numeric(1)` Run `scan_sequences()` in parallel with `nthreads` threads. `nthreads = 0` uses all available threads. Note that no speed up will occur for jobs with only a single motif and sequence.
`motif_pvalue.k`	`numeric(1)` Control `motif_pvalue()` approximation. See `motif_pvalue()`.
`use.gaps`	`logical(1)` Set this to `FALSE` to ignore motif gaps, if present.
`allow.nonfinite`	`logical(1)` If `FALSE`, then apply a pseudocount if non-finite values are found in the PWM. Note that if the motif has a pseudocount greater than zero and the motif is not currently of type PWM, then this parameter has no effect as the pseudocount will be applied automatically when the motif is converted to a PWM internally. This value is set to `FALSE` by default in order to stay consistent with pre-version 1.8.0 behaviour.
`warn.NA`	`logical(1)` Whether to warn about the presence of non-standard letters in the input sequence, such as those in masked sequences.
`calc.pvals`	`logical(1)` Calculate P-values for each hit. This is a convinience option which simply gives `motif_pvalue()` the input motifs and the scores of each hit. Be careful about setting this to `TRUE` if you anticipate getting thousands of hits: expect to wait a few seconds or minutes for the calculations to finish. Increasing the `nthreads` value can help greatly here. See Details for more information on P-value calculation.

Similar to Biostrings::matchPWM(), the scanning method uses logodds scoring. (To see the scoring matrix for any motif, simply run convert_type(motif, "PWM"). For a multifreq scoring matrix: apply(motif["multifreq"][["2"]], 2, ppm_to_pwm)). In order to score a sequence, at each position within a sequence of length equal to the length of the motif, the scores for each base are summed. If the score sum is above the desired threshold, it is kept.

If threshold.type = 'logodds', then the threshold value is multiplied by the maximum possible motif scores. To calculate the maximum possible scores a motif (of type PWM) manually, run motif_score(motif, 1). If threshold.type = 'pvalue', then threshold logodds scores are generated using motif_pvalue(). Finally, if threshold.type = 'logodds.abs', then the exact values provided will be used as thresholds.

Non-standard letters (such as "N", "+", "-", ".", etc in DNAString objects) will be safely ignored, resulting only in a warning and a very minor performance cost. This can used to scan masked sequences. See Biostrings::mask() for masking sequences (generating MaskedXString objects), and Biostrings::injectHardMask() to recover masked XStringSet objects for use with scan_sequences(). There is also a provided wrapper function which performs both steps: mask_seqs().

When calc.pvals = TRUE, motif_pvalue() will calculate the probabilities of getting the input scores or higher, which is why it can take time to calculate the P-values. If you simply wish to calculate the probabilities of getting individual matches based on background frequencies, then the following code can be used to achieve this (using the list of input motifs and scan_sequences() results): mapply(prob_match, motifs[scanRes$motif.i], scanRes$match). Of course this only matters if you do not have uniform background frequencies, or else the probability of each match is simply (1 / nrow(motif))^ncol(motif).

DataFrame with each row representing one hit. If the input sequences are DNAStringSet or RNAStringSet, then an additional column with the strand is included. Function args are stored in the metadata slot.

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

add_multifreq(), Biostrings::matchPWM(), enrich_motifs(), motif_pvalue()

## any alphabet can be used
## Not run: 
set.seed(1)
alphabet <- paste(c(letters), collapse = "")
motif <- create_motif("hello", alphabet = alphabet)
sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000)
scan_sequences(motif, sequences)

## End(Not run)

## Sequence masking:
if (R.Version()$arch != "i386") {
library(Biostrings)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
seq <- mask_seqs(ArabidopsisPromoters, "AAAAA")
scan_sequences(ArabidopsisMotif, seq)
# A warning regarding the presence of non-standard letters will be given,
# but can be safely ignored in this case.
}

## Converting results to a GRanges object:
## Not run: 
res <- scan_sequences(ArabidopsisMotif, seq)
library(GenomicRanges)
makeGRangesFromDataFrame(res, seqnames.field = "sequence",
  keep.extra.columns = TRUE)

## End(Not run)