afind: Stringdist-based fuzzy text search

Description Usage Arguments Details Value Running cosine distance See Also Examples

View source: R/afind.R

Description

afind slides a window of fixed width over a string x and computes the distance between the each window and the sought-after pattern. The location, content, and distance corresponding to the window with the best match is returned.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
afind(
  x,
  pattern,
  window = NULL,
  value = TRUE,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine",
    "jaccard", "jw", "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)

grab(x, pattern, maxDist = Inf, value = FALSE, ...)

grabl(x, pattern, maxDist = Inf, ...)

extract(x, pattern, maxDist = Inf, ...)

Arguments

x

strings to search in

pattern

strings to find (not a regular expression). For grab, grabl, and extract this must be a single string.

window

width of moving window.

value

toggle return matrix with matched strings.

method

Matching algorithm to use. See stringdist-metrics.

useBytes

Perform byte-wise comparison. See stringdist-encoding.

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or 'soundex'.

q

q-gram size, only when method is 'qgram', 'jaccard', or 'cosine'.

p

Winklers 'prefix' parameter for Jaro-Winkler distance, with 0≤q p≤q0.25. Only when method is 'jw'

bt

Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.

nthread

Number of threads used by the underlying C-code. A sensible default is chosen, see stringdist-parallelization.

maxDist

Only windows with distance <= maxDist are considered a match.

...

passed to afind.

Details

Matching is case-sensitive. Both x and pattern are converted to UTF-8 prior to search, unless useBytes=TRUE, in which case the distances are measured bytewise.

Code is parallelized over the x variable: each value of x is scanned for every element in pattern using a separate thread (when nthread is larger than 1).

The functions grab and grabl are approximate string matching functions that somewhat resemble base R's grep and grepl. They are implemented as convenience wrappers of afind.

Value

For afind: a list of three matrices, each with length(x) rows and length(pattern) columns. In each matrix, element (i,j) corresponds to x[i] and pattern[j]. The names and description of each matrix is as follows.

For grab, an integer vector, indicating in which elements of x a match was found with a distance <= maxDist. The matched values when value=TRUE (equivalent to grep).

For grabl, a logical vector, indicating in which elements of x a match was found with a distance <= maxDist. (equivalent to grepl).

For extract, a character matrix with length(x) rows and length(pattern) columns. If match was found, element (i,j) contains the match, otherwise it is set to NA.

Running cosine distance

This algorithm gains efficiency by using that two consecutive windows have a large overlap in their q-gram profiles. It gives the same result as the "cosine" distance, but much faster.

See Also

Other matching: amatch()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
texts = c("When I grow up, I want to be"
       , "one of the harvesters of the sea"
       , "I think before my days are gone"
       , "I want to be a fisherman")
patterns = c("fish", "gone","to be")

afind(texts, patterns, method="running_cosine", q=3)

grabl(texts,"grew", maxDist=1)
extract(texts, "harvested", maxDist=3)

stringdist documentation built on Sept. 9, 2021, 5:08 p.m.