| afind | R Documentation |
afind slides a window of fixed width over a string x and
computes the distance between the each window and the sought-after
pattern. The location, content, and distance corresponding to the
window with the best match is returned.
afind(
x,
pattern,
window = NULL,
value = TRUE,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine",
"jaccard", "jw", "soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)
grab(x, pattern, maxDist = Inf, value = FALSE, ...)
grabl(x, pattern, maxDist = Inf, ...)
extract(x, pattern, maxDist = Inf, ...)
x |
strings to search in |
pattern |
strings to find (not a regular expression). For |
window |
width of moving window. |
value |
toggle return matrix with matched strings. |
method |
Matching algorithm to use. See |
useBytes |
Perform byte-wise comparison. See |
weight |
For |
q |
q-gram size, only when method is |
p |
Winklers 'prefix' parameter for Jaro-Winkler distance, with
|
bt |
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than |
nthread |
Number of threads used by the underlying C-code. A sensible
default is chosen, see |
maxDist |
Only windows with distance |
... |
passed to |
Matching is case-sensitive. Both x and pattern are converted
to UTF-8 prior to search, unless useBytes=TRUE, in which case
the distances are measured bytewise.
Code is parallelized over the x variable: each value of x
is scanned for every element in pattern using a separate thread (when nthread
is larger than 1).
The functions grab and grabl are approximate string matching
functions that somewhat resemble base R's grep and
grepl. They are implemented as convenience wrappers
of afind.
For afind: a list of three matrices, each with
length(x) rows and length(pattern) columns. In each matrix,
element (i,j) corresponds to x[i] and pattern[j]. The
names and description of each matrix is as follows.
location. [integer], location of the start of best matching window.
When useBytes=FALSE, this corresponds to the location of a UTF code point
in x, possibly after conversion from its original encoding.
distance. [character], the string distance between pattern and
the best matching window.
match. [character], the first, best matching window.
For grab, an integer vector, indicating in which elements of
x a match was found with a distance <= maxDist. The matched
values when value=TRUE (equivalent to grep).
For grabl, a logical vector, indicating in which elements of
x a match was found with a distance <= maxDist. (equivalent
to grepl).
For extract, a character matrix with length(x) rows and
length(pattern) columns. If match was found, element (i,j)
contains the match, otherwise it is set to NA.
This algorithm gains efficiency by using that two consecutive windows have
a large overlap in their q-gram profiles. It gives the same result as
the "cosine" distance, but much faster.
Other matching:
amatch()
texts = c("When I grow up, I want to be"
, "one of the harvesters of the sea"
, "I think before my days are gone"
, "I want to be a fisherman")
patterns = c("fish", "gone","to be")
afind(texts, patterns, method="running_cosine", q=3)
grabl(texts,"grew", maxDist=1)
extract(texts, "harvested", maxDist=3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.