fuzzy_match: Link data.tables by fuzzy string matching

Description Usage Arguments Value See Also Examples

View source: R/fuzzy_match.R

Description

Finds the closest string match. The default method computes Jaro-Winkler string distances using the stringdist package. For strings with multiple closest matches, only the first is reported.

Usage

1
2
3
fuzzy_match(a, b, method = "jw", cutoff = 0.5, ...)

fuzzy_check(a, b, method = "jw", ...)

Arguments

a

a source vector of strings

b

a target vector

method

method for stringdistmatrix

cutoff

numeric indicating the maximum distance threshold for a match (fuzzy_match only). String distances equal to or below the cutoff are counted as matches.

...

further arguments for stringdistmatrix

Value

For fuzzy_match, a vector of nearest string matches. For strings with multiple closest matches, only the first is returned.

For fuzzy_check, a data.frame containing the source strings, their closest matches, and the string distance for each match.

See Also

stringdistmatrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(data.table)

set.seed(575)
fruit <- sample(stringr::fruit, 30)

DTA <- data.table(block1 = sample(LETTERS[1:4], 20, TRUE),
                  block2 = sample(LETTERS[1:4], 20, TRUE),
                  fruit   = sample(fruit, 20))

DTB <- data.table(block1 = sample(LETTERS[1:4], 20, TRUE),
                  block2 = sample(LETTERS[1:4], 20, TRUE),
                  fruit   = sample(fruit, 20))

fuzzy_check(DTA$fruit, DTB$fruit)
fuzzy_match(DTA$fruit, DTB$fruit)

setkey(DTB, block1, block2)

DTA[ , fuzzy_check(fruit, b = DTB[.BY, fruit]),
     by = .(block1, block2)]
DTA[ , .(fruit,
         B_fruit = fuzzy_match(fruit, b = DTB[.BY, fruit])),
     by = .(block1, block2)]

coletl/coler documentation built on May 12, 2021, 9:44 p.m.