match_names: Fuzzy matcher

Description Usage Arguments Details Value Author(s) Examples

Description

match_names takes in two data frames, merges them based on the values in fixed, and computes some measures of agreement for the columns in partials

Usage

1
2
match_names(df1, df2, fixed = NA, partials = NA,
                        edits = TRUE, regex = TRUE)

Arguments

df1,df2

data frames to be merged

fixed

character vector of columns to be merged on

partials

columns to fuzzy match on

edits

if TRUE edit distances are computed

regex

if TRUE partial regular expression matches returned

Details

match_names is a function designed to help find duplicates within a data set or find matches between simliar data sets. Often you will want to determine the fixed by which values are most likely to match (like DOB). Then use the function and sort by some of the measures. A small edit distance or proportion indicate a likely match.

Value

a data frame composed of df1 and df2 merged. Additional columns may include _count which are edit distances, _prop variables are the ratio of the edit distance to the mean number of characters, and _regex columns which indicate whether a subset of one name matches the other.

Author(s)

Sven Halvorson (svenedmail@gmail.com)

Examples

1
2
3
4
5
6
7
df1 = data.frame(x = c(1,1,2,2,3),
                  y = c("tricycle","bicycle", "triplane","double triplane", "triceratops"),
                  stringsAsFactors = FALSE)
df2 = data.frame(x = c(2,3,2,1,1),
                  y = c("tritip","biceratops", "triplane", "tripline", "tricycle" ),
                  stringsAsFactors = FALSE)
df3 = match_names(df1, df2, fixed = "x", partials = "y")

svenhalvorson/SvenSFPS documentation built on May 21, 2019, 11:42 a.m.