scan_sequences: Scan sequences for matches to input motifs.

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/scan_sequences.R

Description

For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.

Usage

1
2
3
4
scan_sequences(motifs, sequences, threshold = 0.001,
  threshold.type = "pvalue", RC = FALSE, use.freq = 1, verbose = 0,
  nthreads = 1, motif_pvalue.k = 8, use.gaps = TRUE,
  allow.nonfinite = FALSE, warn.NA = TRUE, calc.pvals = FALSE)

Arguments

motifs

See convert_motifs() for acceptable motif formats.

sequences

XStringSet Sequences to scan. Alphabet should match motif.

threshold

numeric(1) See details.

threshold.type

character(1) One of c('logodds', 'logodds.abs', 'pvalue'). See details.

RC

logical(1) If TRUE, check reverse complement of input sequences.

use.freq

numeric(1) The default, 1, uses the motif matrix (from the motif['motif'] slot) to search for sequences. If a higher number is used, then the matching k-let matrix from the motif['multifreq'] slot is used. See add_multifreq().

verbose

numeric(1) Describe progress, from none (0) to verbose (3).

nthreads

numeric(1) Run scan_sequences() in parallel with nthreads threads. nthreads = 0 uses all available threads. Note that no speed up will occur for jobs with only a single motif and sequence.

motif_pvalue.k

numeric(1) Control motif_pvalue() approximation. See motif_pvalue().

use.gaps

logical(1) Set this to FALSE to ignore motif gaps, if present.

allow.nonfinite

logical(1) If FALSE, then apply a pseudocount if non-finite values are found in the PWM. Note that if the motif has a pseudocount greater than zero and the motif is not currently of type PWM, then this parameter has no effect as the pseudocount will be applied automatically when the motif is converted to a PWM internally. This value is set to FALSE by default in order to stay consistent with pre-version 1.8.0 behaviour.

warn.NA

logical(1) Whether to warn about the presence of non-standard letters in the input sequence, such as those in masked sequences.

calc.pvals

logical(1) Calculate P-values for each hit. This is a convinience option which simply gives motif_pvalue() the input motifs and the scores of each hit. Be careful about setting this to TRUE if you anticipate getting thousands of hits: expect to wait a few seconds or minutes for the calculations to finish. Increasing the nthreads value can help greatly here. See Details for more information on P-value calculation.

Details

Similar to Biostrings::matchPWM(), the scanning method uses logodds scoring. (To see the scoring matrix for any motif, simply run convert_type(motif, "PWM"). For a multifreq scoring matrix: apply(motif["multifreq"][["2"]], 2, ppm_to_pwm)). In order to score a sequence, at each position within a sequence of length equal to the length of the motif, the scores for each base are summed. If the score sum is above the desired threshold, it is kept.

If threshold.type = 'logodds', then the threshold value is multiplied by the maximum possible motif scores. To calculate the maximum possible scores a motif (of type PWM) manually, run motif_score(motif, 1). If threshold.type = 'pvalue', then threshold logodds scores are generated using motif_pvalue(). Finally, if threshold.type = 'logodds.abs', then the exact values provided will be used as thresholds.

Non-standard letters (such as "N", "+", "-", ".", etc in DNAString objects) will be safely ignored, resulting only in a warning and a very minor performance cost. This can used to scan masked sequences. See Biostrings::mask() for masking sequences (generating MaskedXString objects), and Biostrings::injectHardMask() to recover masked XStringSet objects for use with scan_sequences(). There is also a provided wrapper function which performs both steps: mask_seqs().

When calc.pvals = TRUE, motif_pvalue() will calculate the probabilities of getting the input scores or higher, which is why it can take time to calculate the P-values. If you simply wish to calculate the probabilities of getting individual matches based on background frequencies, then the following code can be used to achieve this (using the list of input motifs and scan_sequences() results): mapply(prob_match, motifs[scanRes$motif.i], scanRes$match). Of course this only matters if you do not have uniform background frequencies, or else the probability of each match is simply (1 / nrow(motif))^ncol(motif).

Value

DataFrame with each row representing one hit. If the input sequences are DNAStringSet or RNAStringSet, then an additional column with the strand is included. Function args are stored in the metadata slot.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

See Also

add_multifreq(), Biostrings::matchPWM(), enrich_motifs(), motif_pvalue()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## any alphabet can be used
## Not run: 
set.seed(1)
alphabet <- paste(c(letters), collapse = "")
motif <- create_motif("hello", alphabet = alphabet)
sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000)
scan_sequences(motif, sequences)

## End(Not run)

## Sequence masking:
if (R.Version()$arch != "i386") {
library(Biostrings)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
seq <- mask_seqs(ArabidopsisPromoters, "AAAAA")
scan_sequences(ArabidopsisMotif, seq)
# A warning regarding the presence of non-standard letters will be given,
# but can be safely ignored in this case.
}

## Converting results to a GRanges object:
## Not run: 
res <- scan_sequences(ArabidopsisMotif, seq)
library(GenomicRanges)
makeGRangesFromDataFrame(res, seqnames.field = "sequence",
  keep.extra.columns = TRUE)

## End(Not run)

universalmotif documentation built on April 8, 2021, 6 p.m.