afind: Stringdist-based fuzzy text search

View source: R/afind.R

afindR Documentation

Stringdist-based fuzzy text search


afind slides a window of fixed width over a string x and computes the distance between the each window and the sought-after pattern. The location, content, and distance corresponding to the window with the best match is returned.


  window = NULL,
  value = TRUE,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine",
    "jaccard", "jw", "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")

grab(x, pattern, maxDist = Inf, value = FALSE, ...)

grabl(x, pattern, maxDist = Inf, ...)

extract(x, pattern, maxDist = Inf, ...)



strings to search in


strings to find (not a regular expression). For grab, grabl, and extract this must be a single string.


width of moving window.


toggle return matrix with matched strings.


Matching algorithm to use. See stringdist-metrics.


Perform byte-wise comparison. See stringdist-encoding.


For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or 'soundex'.


q-gram size, only when method is 'qgram', 'jaccard', or 'cosine'.


Winklers 'prefix' parameter for Jaro-Winkler distance, with 0≤q p≤q0.25. Only when method is 'jw'


Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.


Number of threads used by the underlying C-code. A sensible default is chosen, see stringdist-parallelization.


Only windows with distance <= maxDist are considered a match.


passed to afind.


Matching is case-sensitive. Both x and pattern are converted to UTF-8 prior to search, unless useBytes=TRUE, in which case the distances are measured bytewise.

Code is parallelized over the x variable: each value of x is scanned for every element in pattern using a separate thread (when nthread is larger than 1).

The functions grab and grabl are approximate string matching functions that somewhat resemble base R's grep and grepl. They are implemented as convenience wrappers of afind.


For afind: a list of three matrices, each with length(x) rows and length(pattern) columns. In each matrix, element (i,j) corresponds to x[i] and pattern[j]. The names and description of each matrix is as follows.

  • location. [integer], location of the start of best matching window. When useBytes=FALSE, this corresponds to the location of a UTF code point in x, possibly after conversion from its original encoding.

  • distance. [character], the string distance between pattern and the best matching window.

  • match. [character], the first, best matching window.

For grab, an integer vector, indicating in which elements of x a match was found with a distance <= maxDist. The matched values when value=TRUE (equivalent to grep).

For grabl, a logical vector, indicating in which elements of x a match was found with a distance <= maxDist. (equivalent to grepl).

For extract, a character matrix with length(x) rows and length(pattern) columns. If match was found, element (i,j) contains the match, otherwise it is set to NA.

Running cosine distance

This algorithm gains efficiency by using that two consecutive windows have a large overlap in their q-gram profiles. It gives the same result as the "cosine" distance, but much faster.

See Also

Other matching: amatch()


texts = c("When I grow up, I want to be"
       , "one of the harvesters of the sea"
       , "I think before my days are gone"
       , "I want to be a fisherman")
patterns = c("fish", "gone","to be")

afind(texts, patterns, method="running_cosine", q=3)

grabl(texts,"grew", maxDist=1)
extract(texts, "harvested", maxDist=3)

stringdist documentation built on Nov. 10, 2022, 5:08 p.m.