Lowlevel matching functions
Description
In this man page we define precisely and illustrate what a "match" of a pattern P in a subject S is in the context of the Biostrings package. This definition of a "match" is central to most pattern matching functions available in this package: unless specified otherwise, most of them will adhere to the definition provided here.
hasLetterAt
checks whether a sequence or set of sequences has the
specified letters at the specified positions.
neditAt
, isMatchingAt
and which.isMatchingAt
are
lowlevel matching functions that only look for matches at the specified
positions in the subject.
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  hasLetterAt(x, letter, at, fixed=TRUE)
## neditAt() and related utils:
neditAt(pattern, subject, at=1,
with.indels=FALSE, fixed=TRUE)
neditStartingAt(pattern, subject, starting.at=1,
with.indels=FALSE, fixed=TRUE)
neditEndingAt(pattern, subject, ending.at=1,
with.indels=FALSE, fixed=TRUE)
## isMatchingAt() and related utils:
isMatchingAt(pattern, subject, at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingStartingAt(pattern, subject, starting.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingEndingAt(pattern, subject, ending.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
## which.isMatchingAt() and related utils:
which.isMatchingAt(pattern, subject, at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingStartingAt(pattern, subject, starting.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingEndingAt(pattern, subject, ending.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)

Arguments
x 
A character vector, or an XString or XStringSet object. 
letter 
A character string or an XString object containing the letters to check. 
at, starting.at, ending.at 
An integer vector specifying the starting (for For the 
pattern 
The pattern string (but see 
subject 
A character vector, or an XString or XStringSet object containing the subject sequence(s). 
max.mismatch, min.mismatch 
Integer vectors of length >= 1 recycled to the length of the

with.indels 
See details below. 
fixed 
Only with a DNAString or RNAStringbased subject can a
If

follow.index 
Whether the single integer returned by 
auto.reduce.pattern 
Whether 
Details
A "match" of pattern P in subject S is a substring S' of S that is considered similar enough to P according to some distance (or metric) specified by the user. 2 distances are supported by most pattern matching functions in the Biostrings package. The first (and simplest) one is the "number of mismatching letters". It is defined only when the 2 strings to compare have the same length, so when this distance is used, only matches that have the same number of letters as P are considered. The second one is the "edit distance" (aka Levenshtein distance): it's the minimum number of operations needed to transform P into S', where an operation is an insertion, deletion, or substitution of a single letter. When this metric is used, matches can have a different number of letters than P.
The neditAt
function implements these 2 distances.
If with.indels
is FALSE
(the default), then the first distance
is used i.e. neditAt
returns the "number of mismatching letters"
between the pattern P and the substring S' of S starting at the
positions specified in at
(note that neditAt
is vectorized
so a long vector of integers can be passed thru the at
argument).
If with.indels
is TRUE
, then the "edit distance" is
used: for each position specified in at
, P is compared to
all the substrings S' of S starting at this position and the smallest
distance is returned. Note that this distance is guaranteed to be reached
for a substring of length < 2*length(P) so, of course, in practice,
P only needs to be compared to a small number of substrings for every
starting position.
Value
hasLetterAt
: A logical matrix with one row per element in x
and one column per letter/position to check. When a specified position
is invalid with respect to an element in x
then the corresponding
matrix element is set to NA.
neditAt
: If subject
is an XString object, then
return an integer vector of the same length as at
.
If subject
is an XStringSet object, then return the
integer matrix with length(at)
rows and length(subject)
columns defined by:
1 2 3 
neditStartingAt
is identical to neditAt
except
that the at
argument is now called starting.at
.
neditEndingAt
is similar to neditAt
except that
the at
argument is now called ending.at
and must contain
the ending positions of the pattern relatively to the subject.
isMatchingAt
: If subject
is an XString object,
then return the logical vector defined by:
1 2  min.mismatch <= neditAt(...) <= max.mismatch

If subject
is an XStringSet object, then return the
logical matrix with length(at)
rows and length(subject)
columns defined by:
1 2 3 
isMatchingStartingAt
is identical to isMatchingAt
except
that the at
argument is now called starting.at
.
isMatchingEndingAt
is similar to isMatchingAt
except that
the at
argument is now called ending.at
and must contain
the ending positions of the pattern relatively to the subject.
which.isMatchingAt
: The default behavior (follow.index=FALSE
)
is as follow. If subject
is an XString object,
then return the single integer defined by:
1 2 
If subject
is an XStringSet object, then return
the integer vector defined by:
1 2 3 
If follow.index=TRUE
, then the returned value is defined by:
1 2 
which.isMatchingStartingAt
is identical to which.isMatchingAt
except that the at
argument is now called starting.at
.
which.isMatchingEndingAt
is similar to which.isMatchingAt
except that the at
argument is now called ending.at
and must
contain the ending positions of the pattern relatively to the subject.
See Also
nucleotideFrequencyAt
,
matchPattern
,
matchPDict
,
matchLRPatterns
,
trimLRPatterns
,
IUPAC_CODE_MAP
,
XStringclass,
alignutils
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83  ## 
## hasLetterAt()
## 
x < DNAStringSet(c("AAACGT", "AACGT", "ACGT", "TAGGA"))
hasLetterAt(x, "AAAAAA", 1:6)
## hasLetterAt() can be used to answer questions like: "which elements
## in 'x' have an A at position 2 and a G at position 4?"
q1 < hasLetterAt(x, "AG", c(2, 4))
which(rowSums(q1) == 2)
## or "how many probes in the drosophila2 chip have T, G, T, A at
## position 2, 4, 13 and 20, respectively?"
library(drosophila2probe)
probes < DNAStringSet(drosophila2probe)
q2 < hasLetterAt(probes, "TGTA", c(2, 4, 13, 20))
sum(rowSums(q2) == 4)
## or "what's the probability to have an A at position 25 if there is
## one at position 13?"
q3 < hasLetterAt(probes, "AACGT", c(13, 25, 25, 25, 25))
sum(q3[ , 1] & q3[ , 2]) / sum(q3[ , 1])
## Probabilities to have other bases at position 25 if there is an A
## at position 13:
sum(q3[ , 1] & q3[ , 3]) / sum(q3[ , 1]) # C
sum(q3[ , 1] & q3[ , 4]) / sum(q3[ , 1]) # G
sum(q3[ , 1] & q3[ , 5]) / sum(q3[ , 1]) # T
## See ?nucleotideFrequencyAt for another way to get those results.
## 
## neditAt() / isMatchingAt() / which.isMatchingAt()
## 
subject < DNAString("GTATA")
## Pattern "AT" matches subject "GTATA" at position 3 (exact match)
neditAt("AT", subject, at=3)
isMatchingAt("AT", subject, at=3)
## ... but not at position 1
neditAt("AT", subject)
isMatchingAt("AT", subject)
## ... unless we allow 1 mismatching letter (inexact match)
isMatchingAt("AT", subject, max.mismatch=1)
## Here we look at 6 different starting positions and find 3 matches if
## we allow 1 mismatching letter
isMatchingAt("AT", subject, at=0:5, max.mismatch=1)
## No match
neditAt("NT", subject, at=1:4)
isMatchingAt("NT", subject, at=1:4)
## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)
neditAt("NT", subject, at=1:4, fixed=FALSE)
isMatchingAt("NT", subject, at=1:4, fixed=FALSE)
## max.mismatch != 0 and fixed=FALSE can be used together
neditAt("NCA", subject, at=0:5, fixed=FALSE)
isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE)
some_starts < c(10:10, NA, 6)
subject < DNAString("ACGTGCA")
is_matching < isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
some_starts[is_matching]
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1,
follow.index=TRUE)
## 
## WITH INDELS
## 
subject < BString("ABCDEFxxxCDEFxxxABBCDE")
neditAt("ABCDEF", subject, at=9)
neditAt("ABCDEF", subject, at=9, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=1, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=2, with.indels=TRUE)
neditAt("ABCDEF", subject, at=17)
neditAt("ABCDEF", subject, at=17, with.indels=TRUE)
neditEndingAt("ABCDEF", subject, ending.at=22)
neditEndingAt("ABCDEF", subject, ending.at=22, with.indels=TRUE)
