The `matchPDict`

, `countPDict`

and `whichPDict`

functions
efficiently find the occurrences in a text (the subject) of all patterns
stored in a preprocessed dictionary.

This man page shows how to use these functions for inexact (or fuzzy) matching or when the original dictionary has a variable width.

See `?matchPDict`

for how to use these functions for exact
matching of a constant width dictionary i.e. a dictionary where all the
patterns have the same length (same number of nucleotides).

In this man page, we assume that you know how to preprocess
a dictionary of DNA patterns that can then be used with
`matchPDict`

, `countPDict`

or `whichPDict`

.
Please see `?PDict`

if you don't.

`matchPDict`

and family support different kinds of inexact
matching but with some restrictions. Inexact matching is controlled
via the definition of a Trusted Band during the preprocessing step
and/or via the `max.mismatch`

, `min.mismatch`

and `fixed`

arguments.
Defining a Trusted Band is also required when the original dictionary
is not rectangular (variable width), even for exact matching.
See `?PDict`

for how to define a Trusted Band.

Here is how `matchPDict`

and family handle the Trusted Band
defined on `pdict`

:

(1) Find all the exact matches of all the elements in the Trusted Band.

(2) For each element in the Trusted Band that has at least one exact match, compare the head and the tail of this element with the flanking sequences of the matches found in (1).

Note that the number of exact matches found in (1) will decrease
exponentially with the width of the Trusted Band.
Here is a simple guideline in order to get reasonably good
performance: if TBW is the width of the Trusted Band
(`TBW <- tb.width(pdict)`

) and L the number of letters in the
subject (`L <- nchar(subject)`

), then `L / (4^TBW)`

should
be kept as small as possible, typically < 10 or 20.

In addition, when a Trusted Band has been defined during preprocessing,
then `matchPDict`

and family can be called with `fixed=FALSE`

.
In this case, IUPAC ambiguity codes in the head or the tail of the
PDict object are treated as ambiguities.

Finally, `fixed="pattern"`

can be used to indicate that IUPAC
ambiguity codes in the subject should be treated as ambiguities.
It only works if the density of codes is not too high.
It works whether or not a Trusted Band has been defined on `pdict`

.

H. Pagès

Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.

PDict-class, MIndex-class, matchPDict

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | ```
## ---------------------------------------------------------------------
## A. USING AN EXPLICIT TRUSTED BAND
## ---------------------------------------------------------------------
library(drosophila2probe)
dict0 <- DNAStringSet(drosophila2probe)
dict0 # the original dictionary
## Preprocess the original dictionary by defining a Trusted Band that
## spans nucleotides 1 to 9 of each pattern.
pdict9 <- PDict(dict0, tb.end=9)
pdict9
tail(pdict9)
sum(duplicated(pdict9))
table(patternFrequency(pdict9))
library(BSgenome.Dmelanogaster.UCSC.dm3)
chr3R <- Dmelanogaster$chr3R
chr3R
table(countPDict(pdict9, chr3R, max.mismatch=1))
table(countPDict(pdict9, chr3R, max.mismatch=3))
table(countPDict(pdict9, chr3R, max.mismatch=5))
## ---------------------------------------------------------------------
## B. COMPARISON WITH EXACT MATCHING
## ---------------------------------------------------------------------
## When the original dictionary is of constant width, exact matching
## (i.e. 'max.mismatch=0' and 'fixed=TRUE) will be more efficient with
## a full-width Trusted Band (i.e. a Trusted Band that covers the entire
## dictionary) than with a Trusted Band of width < width(dict0).
pdict0 <- PDict(dict0)
count0 <- countPDict(pdict0, chr3R)
count0b <- countPDict(pdict9, chr3R, max.mismatch=0)
identical(count0b, count0) # TRUE
## ---------------------------------------------------------------------
## C. USING AN EXPLICIT TRUSTED BAND ON A VARIABLE WIDTH DICTIONARY
## ---------------------------------------------------------------------
## Here is a small variable width dictionary that contains IUPAC
## ambiguities (pattern 1 and 3 contain an N):
dict0 <- DNAStringSet(c("TACCNG", "TAGT", "CGGNT", "AGTAG", "TAGT"))
## (Note that pattern 2 and 5 are identical.)
## If we only want to do exact matching, then it is recommended to use
## the widest possible Trusted Band i.e. to set its width to
## 'min(width(dict0))' because this is what will give the best
## performance. However, when 'dict0' contains IUPAC ambiguities (like
## in our case), it could be that one of them is falling into the
## Trusted Band so we get an error (only base letters can go in the
## Trusted Band for now):
## Not run:
PDict(dict0, tb.end=min(width(dict0))) # Error!
## End(Not run)
## In our case, the Trusted Band cannot be wider than 3:
pdict <- PDict(dict0, tb.end=3)
tail(pdict)
subject <- DNAString("TAGTACCAGTTTCGGG")
m <- matchPDict(pdict, subject)
elementNROWS(m) # pattern 2 and 5 have 1 exact match
m[[2]]
## We can take advantage of the fact that our Trusted Band doesn't cover
## the entire dictionary to allow inexact matching on the uncovered parts
## (the tail in our case):
m <- matchPDict(pdict, subject, fixed=FALSE)
elementNROWS(m) # now pattern 1 has 1 match too
m[[1]]
m <- matchPDict(pdict, subject, max.mismatch=1)
elementNROWS(m) # now pattern 4 has 1 match too
m[[4]]
m <- matchPDict(pdict, subject, max.mismatch=1, fixed=FALSE)
elementNROWS(m) # now pattern 3 has 1 match too
m[[3]] # note that this match is "out of limit"
Views(subject, m[[3]])
m <- matchPDict(pdict, subject, max.mismatch=2)
elementNROWS(m) # pattern 4 gets 1 additional match
m[[4]]
## Unlist all matches:
unlist(m)
## ---------------------------------------------------------------------
## D. WITH IUPAC AMBIGUITY CODES IN THE PATTERNS
## ---------------------------------------------------------------------
## The Trusted Band cannot contain IUPAC ambiguity codes so patterns
## with ambiguity codes can only be preprocessed if we can define a
## Trusted Band with no ambiguity codes in it.
dict <- DNAStringSet(c("AAACAAKS", "GGGAAA", "TNCCGGG"))
pdict <- PDict(dict, tb.start=3, tb.width=4)
subject <- DNAString("AAACAATCCCGGGAAACAAGG")
matchPDict(pdict, subject)
matchPDict(pdict, subject, fixed="subject")
## Sanity checks:
res1 <- as.list(matchPDict(pdict, subject))
res2 <- as.list(matchPDict(dict, subject))
res3 <- lapply(dict,
function(pattern)
as(matchPattern(pattern, subject), "IRanges"))
stopifnot(identical(res1, res2))
stopifnot(identical(res1, res3))
res1 <- as.list(matchPDict(pdict, subject, fixed="subject"))
res2 <- as.list(matchPDict(dict, subject, fixed="subject"))
res3 <- lapply(dict,
function(pattern)
as(matchPattern(pattern, subject, fixed="subject"), "IRanges"))
stopifnot(identical(res1, res2))
stopifnot(identical(res1, res3))
## ---------------------------------------------------------------------
## E. WITH IUPAC AMBIGUITY CODES IN THE SUBJECT
## ---------------------------------------------------------------------
## 'fixed="pattern"' (or 'fixed=FALSE') can be used to indicate that
## IUPAC ambiguity codes in the subject should be treated as ambiguities.
pdict <- PDict(c("ACAC", "TCCG"))
matchPDict(pdict, DNAString("ACNCCGT"))
matchPDict(pdict, DNAString("ACNCCGT"), fixed="pattern")
matchPDict(pdict, DNAString("ACWCCGT"), fixed="pattern")
matchPDict(pdict, DNAString("ACRCCGT"), fixed="pattern")
matchPDict(pdict, DNAString("ACKCCGT"), fixed="pattern")
dict <- DNAStringSet(c("TTC", "CTT"))
pdict <- PDict(dict)
subject <- DNAString("CYTCACTTC")
mi1 <- matchPDict(pdict, subject, fixed="pattern")
mi2 <- matchPDict(dict, subject, fixed="pattern")
stopifnot(identical(as.list(mi1), as.list(mi2)))
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

All documentation is copyright its authors; we didn't write any of that.