afind | R Documentation |
afind
slides a window of fixed width over a string x
and
computes the distance between the each window and the sought-after
pattern
. The location, content, and distance corresponding to the
window with the best match is returned.
afind( x, pattern, window = NULL, value = TRUE, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), q = 1, p = 0, bt = 0, nthread = getOption("sd_num_thread") ) grab(x, pattern, maxDist = Inf, value = FALSE, ...) grabl(x, pattern, maxDist = Inf, ...) extract(x, pattern, maxDist = Inf, ...)
x |
strings to search in |
pattern |
strings to find (not a regular expression). For |
window |
width of moving window. |
value |
toggle return matrix with matched strings. |
method |
Matching algorithm to use. See |
useBytes |
Perform byte-wise comparison. See |
weight |
For |
q |
q-gram size, only when method is |
p |
Winklers 'prefix' parameter for Jaro-Winkler distance, with
0≤q p≤q0.25. Only when method is |
bt |
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than |
nthread |
Number of threads used by the underlying C-code. A sensible
default is chosen, see |
maxDist |
Only windows with distance |
... |
passed to |
Matching is case-sensitive. Both x
and pattern
are converted
to UTF-8
prior to search, unless useBytes=TRUE
, in which case
the distances are measured bytewise.
Code is parallelized over the x
variable: each value of x
is scanned for every element in pattern
using a separate thread (when nthread
is larger than 1).
The functions grab
and grabl
are approximate string matching
functions that somewhat resemble base R's grep
and
grepl
. They are implemented as convenience wrappers
of afind
.
For afind
: a list
of three matrices, each with
length(x)
rows and length(pattern)
columns. In each matrix,
element (i,j) corresponds to x[i]
and pattern[j]
. The
names and description of each matrix is as follows.
location
. [integer]
, location of the start of best matching window.
When useBytes=FALSE
, this corresponds to the location of a UTF
code point
in x
, possibly after conversion from its original encoding.
distance
. [character]
, the string distance between pattern and
the best matching window.
match
. [character]
, the first, best matching window.
For grab
, an integer
vector, indicating in which elements of
x
a match was found with a distance <= maxDist
. The matched
values when value=TRUE
(equivalent to grep
).
For grabl
, a logical
vector, indicating in which elements of
x
a match was found with a distance <= maxDist
. (equivalent
to grepl
).
For extract
, a character
matrix with length(x)
rows and
length(pattern)
columns. If match was found, element (i,j)
contains the match, otherwise it is set to NA
.
This algorithm gains efficiency by using that two consecutive windows have
a large overlap in their q-gram profiles. It gives the same result as
the "cosine"
distance, but much faster.
Other matching:
amatch()
texts = c("When I grow up, I want to be" , "one of the harvesters of the sea" , "I think before my days are gone" , "I want to be a fisherman") patterns = c("fish", "gone","to be") afind(texts, patterns, method="running_cosine", q=3) grabl(texts,"grew", maxDist=1) extract(texts, "harvested", maxDist=3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.