amatch: Approximate string matching

View source: R/amatch.R

amatchR Documentation

Approximate string matching


Approximate string matching equivalents of R's native match and %in%.


  nomatch = NA_integer_,
  matchNA = TRUE,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  maxDist = 0.1,
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")

ain(x, table, ...)



elements to be approximately matched: will be coerced to character unless it is a list consisting of integer vectors.


lookup table for matching. Will be coerced to character unless it is a list consting of integer vectors.


The value to be returned when no match is found. This is coerced to integer.


Should NA's be matched? Default behaviour mimics the behaviour of base match, meaning that NA matches NA (see also the note on NA handling below).


Matching algorithm to use. See stringdist-metrics.


Perform byte-wise comparison. See stringdist-encoding.


For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or 'soundex'.


Elements in x will not be matched with elements of table if their distance is larger than maxDist. Note that the maximum distance between strings depends on the method: it should always be specified.


q-gram size, only when method is 'qgram', 'jaccard', or 'cosine'.


Winklers 'prefix' parameter for Jaro-Winkler distance, with 0≤q p≤q0.25. Only when method is 'jw'


Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.


Number of threads used by the underlying C-code. A sensible default is chosen, see stringdist-parallelization.


parameters to pass to amatch (except nomatch)


ain is currently defined as

ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0


amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned. ain returns a logical vector of length length(x) indicating wether an element of x approximately matches an element in table.

Note on NA handling

R's native match function matches NA with NA. This may feel inconsistent with R's usual NA handling, since for example NA==NA yields NA rather than TRUE. In most cases, one may reason about the behaviour under NA along the lines of “if one of the arguments is NA, the result shall be NA”, simply because not all information necessary to execute the function is available. One uses special functions such as, is.null etc. to handle special values.

The amatch function mimics the behaviour of match by default: NA is matched with NA and with nothing else. Note that this is inconsistent with the behaviour of stringdist since stringdist yields NA when at least one of the arguments is NA. The same inconsistency exists between match and adist. In amatch this behaviour can be controlled by setting matchNA=FALSE. In that case, if any of the arguments in x is NA, the nomatch value is returned, regardless of whether NA is present in table. In match the behaviour can be controlled by setting the incomparables option.

See Also

Other matching: afind()


# lets see which sci-fi heroes are stringdistantly nearest

# we can restrict the search

# we can match each value in the find vector against values in the lookup table:

# setting nomatch returns a different value when no match is found

# this is always true if maxDist is Inf

# Let's look in a neighbourhood of maximum 2 typo's (by default, the OSA algorithm is used)
ain("leia",c("uhura","leela"), maxDist=2)

stringdist documentation built on Nov. 10, 2022, 5:08 p.m.