fuzzy_match: Link data.tables by fuzzy string matching

Description Usage Arguments Value See Also Examples

View source: R/fuzzy_match.R

Description

Finds the closest string match between two data.tables. The default method computes Jaro-Winkler string distances using the stringdist package. In cases with multiple closest matches, only the first match is reported.

Usage

1
fuzzy_match(a, b, acol, bcol, method = "jw", ...)

Arguments

a

a source data.table

b

a target data.table

acol

column name in a to use for matching

bcol

column name in b to use for matching

method

method for stringdistmatrix

Value

a data.table containing any blocking columns, the source column, the closest match in the target column, and the string distance for that match.

See Also

stringdistmatrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(data.table)

set.seed(575)
DTA <- data.table(block1 = sample(LETTERS[1:4], 20, TRUE),
                  block2 = sample(LETTERS[1:4], 20, TRUE),
                  fruit   = sample(stringr::fruit[1:12], 20, TRUE))

DTB <- data.table(block1 = sample(LETTERS[1:4], 20, TRUE),
                  block2 = sample(LETTERS[1:4], 20, TRUE),
                  fruit   = sample(stringr::fruit[1:12], 20, TRUE))

fuzzy_match(DTA, DTB, "fruit", "fruit")

setkey(DTA, block1, block2)
setkey(DTB, block1, block2)

DTA[ , fuzzy_match(.SD, b = DTB[.BY], "fruit", "fruit"),
       by = .(block1, block2)]

coletl/easyr documentation built on June 10, 2020, 4:58 p.m.