amatch | R Documentation |
Approximate string matching equivalents of R
's native
match
and %in%
.
amatch(
x,
table,
nomatch = NA_integer_,
matchNA = TRUE,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
maxDist = 0.1,
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)
ain(x, table, ...)
x |
elements to be approximately matched: will be coerced to
|
table |
lookup table for matching. Will be coerced to |
nomatch |
The value to be returned when no match is found. This is coerced to integer. |
matchNA |
Should |
method |
Matching algorithm to use. See |
useBytes |
Perform byte-wise comparison. See |
weight |
For |
maxDist |
Elements in |
q |
q-gram size, only when method is |
p |
Winklers 'prefix' parameter for Jaro-Winkler distance, with
|
bt |
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than |
nthread |
Number of threads used by the underlying C-code. A sensible
default is chosen, see |
... |
parameters to pass to |
ain
is currently defined as
ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0
amatch
returns the position of the closest match of x
in table
. When multiple matches with the same smallest distance
metric exist, the first one is returned. ain
returns a
logical
vector of length length(x)
indicating wether an
element of x
approximately matches an element in table
.
NA
handlingR
's native match
function matches NA
with
NA
. This may feel inconsistent with R
's usual NA
handling, since for example NA==NA
yields
NA
rather than TRUE
. In most cases, one may reason about the
behaviour under NA
along the lines of “if one of the arguments is
NA
, the result shall be NA
”, simply because not all
information necessary to execute the function is available. One uses special
functions such as is.na
, is.null
etc. to handle special
values.
The amatch
function mimics the behaviour of match
by default: NA
is matched with NA
and with nothing else. Note
that this is inconsistent with the behaviour of stringdist
since stringdist
yields NA
when at least one of the arguments
is NA
. The same inconsistency exists between match
and adist
. In amatch
this behaviour can be
controlled by setting matchNA=FALSE
. In that case, if any of the
arguments in x
is NA
, the nomatch
value is returned,
regardless of whether NA
is present in table
. In
match
the behaviour can be controlled by setting the
incomparables
option.
Other matching:
afind()
# lets see which sci-fi heroes are stringdistantly nearest
amatch("leia",c("uhura","leela"),maxDist=5)
# we can restrict the search
amatch("leia",c("uhura","leela"),maxDist=1)
# we can match each value in the find vector against values in the lookup table:
amatch(c("leia","uhura"),c("ripley","leela","scully","trinity"),maxDist=2)
# setting nomatch returns a different value when no match is found
amatch("leia",c("uhura","leela"),maxDist=1,nomatch=0)
# this is always true if maxDist is Inf
ain("leia",c("uhura","leela"),maxDist=Inf)
# Let's look in a neighbourhood of maximum 2 typo's (by default, the OSA algorithm is used)
ain("leia",c("uhura","leela"), maxDist=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.