fuzzy_match: fuzzy_match function

Description Usage Arguments Details Value See Also Examples

View source: R/fuzzy_match.R

Description

Function to join tables where the values we are matching by do not match exactly

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fuzzy_match(
  df.x,
  df.y,
  by.x,
  by.y,
  method = "jw",
  cutoff,
  join_type = "left",
  unique = F,
  match_vals = TRUE,
  sort = NULL,
  useBytes = FALSE,
  p = 0,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  bt = 0
)

Arguments

df.x

Left table to be joined

df.y

Right table to be joined

by.x

A character vector of variables to join the left table by

by.y

A character vector of variables to join the right table by

method

method to calculate string distance by. See help for stringdist::stringdist , Default: 'jw'

cutoff

Maximum string distance to allow matching by, 0 requires exact matches

join_type

Type of join to perform. Accepts left, right, inner, full, semi, and anti. Default: 'left'

unique

If true will only match unique values, Default: F

match_vals

Create a column to display the string distance, Default: TRUE

sort

Will sort the table based on string distance. Accepts "desc", "asc" and NULL. Default: NULL

useBytes

If TRUE, the matching is done byte-by-byte rather than character-by-character, Default: FALSE

p

Penalty factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw', Default: 0

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or soundex., Default: c(d = 1, i = 1, s = 1, t = 1)

q

Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'., Default: 1

bt

Winkler's boost threshold. Winkler's penalty factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0., Default: 0

Details

Function to join tables where the columns to join by don't match exactly. Should use the clean function prior to running fuzzy_match

Value

Returns a data.frame with two data.frames input joined

See Also

unfactor stringdist join,arrange

Examples

1
2
3
4
5
6
## Not run: 
congress <- clean(congress, name, selected = ",", prefixes = T, suffixes = T)
politwoops <- clean(politwoops, full_name, selected = ",", prefixes = T, suffixes = T)
fuzzy_match(congress, politwoops, name, full_name, join_type = "inner", cutoff = .1)

## End(Not run)

hkarp1/fuz.merge documentation built on Sept. 2, 2020, 12:05 a.m.