amatch: Approximate string matching

View source: R/amatch.R

amatchR Documentation

Approximate string matching

Description

Approximate string matching equivalents of R's native match and %in%.

Usage

amatch(
  x,
  table,
  nomatch = NA_integer_,
  matchNA = TRUE,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  maxDist = 0.1,
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)

ain(x, table, ...)

Arguments

x

elements to be approximately matched: will be coerced to character unless it is a list consisting of integer vectors.

table

lookup table for matching. Will be coerced to character unless it is a list consting of integer vectors.

nomatch

The value to be returned when no match is found. This is coerced to integer.

matchNA

Should NA's be matched? Default behaviour mimics the behaviour of base match, meaning that NA matches NA (see also the note on NA handling below).

method

Matching algorithm to use. See stringdist-metrics.

useBytes

Perform byte-wise comparison. See stringdist-encoding.

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or 'soundex'.

maxDist

Elements in x will not be matched with elements of table if their distance is larger than maxDist. Note that the maximum distance between strings depends on the method: it should always be specified.

q

q-gram size, only when method is 'qgram', 'jaccard', or 'cosine'.

p

Winklers 'prefix' parameter for Jaro-Winkler distance, with 0\leq p\leq0.25. Only when method is 'jw'

bt

Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.

nthread

Number of threads used by the underlying C-code. A sensible default is chosen, see stringdist-parallelization.

...

parameters to pass to amatch (except nomatch)

Details

ain is currently defined as

ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0

Value

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned. ain returns a logical vector of length length(x) indicating wether an element of x approximately matches an element in table.

Note on NA handling

R's native match function matches NA with NA. This may feel inconsistent with R's usual NA handling, since for example NA==NA yields NA rather than TRUE. In most cases, one may reason about the behaviour under NA along the lines of “if one of the arguments is NA, the result shall be NA”, simply because not all information necessary to execute the function is available. One uses special functions such as is.na, is.null etc. to handle special values.

The amatch function mimics the behaviour of match by default: NA is matched with NA and with nothing else. Note that this is inconsistent with the behaviour of stringdist since stringdist yields NA when at least one of the arguments is NA. The same inconsistency exists between match and adist. In amatch this behaviour can be controlled by setting matchNA=FALSE. In that case, if any of the arguments in x is NA, the nomatch value is returned, regardless of whether NA is present in table. In match the behaviour can be controlled by setting the incomparables option.

See Also

Other matching: afind()

Examples


# lets see which sci-fi heroes are stringdistantly nearest
amatch("leia",c("uhura","leela"),maxDist=5)

# we can restrict the search
amatch("leia",c("uhura","leela"),maxDist=1)

# we can match each value in the find vector against values in the lookup table:
amatch(c("leia","uhura"),c("ripley","leela","scully","trinity"),maxDist=2)

# setting nomatch returns a different value when no match is found
amatch("leia",c("uhura","leela"),maxDist=1,nomatch=0)

# this is always true if maxDist is Inf
ain("leia",c("uhura","leela"),maxDist=Inf)

# Let's look in a neighbourhood of maximum 2 typo's (by default, the OSA algorithm is used)
ain("leia",c("uhura","leela"), maxDist=2)



stringdist documentation built on May 29, 2024, 11:13 a.m.