In hkarp1/fuz.merge: This package provides functions to implement and aid in Fuzzy Matching

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE, message = FALSE)

Often when we want to join two tables together, we don't have unique ID to join by and have to use a column they have in common, often names. A lot of the time the data is entered manually so there might be typos in the names, or it might use a different format. For example, one might include prefixes or suffixes while the other might not. To do a normal join these columns would not match. The fuz.merge package allows us to join tables based on approximate matches and provides other functions to facilitate this process to ensure matches are accurate.

The Function

The primary function of this package is fuzzy_match

  fuzzy_match(table1, table2, by1, by2, method = "jw", cutoff = .1, join_type = "inner", sort = "desc")

The function implements fuzzy matching by calculating finding the closest match between tables where the distance between the two values is less than the specified cutoff.

Data Sets

The congress data frame has information on 541 U.S. Congressmembers and was retrieved from data.world
The politwoops data frame has 1852 observations of various public officials who are being tracked by Politwoops for deleting tweets and was retrieved from propublica.org

Example: Inner Join

library(fuz.merge)
congress <- dplyr::select(congress, name, gender)
politwoops <- dplyr::select(politwoops, full_name, user_name, account_type)
congress$name <- scramble(congress$name, 2)
politwoops$full_name <- scramble(politwoops$full_name, 1)

congress[1:10,"name"]
politwoops[1:10,"full_name"]

The scramble function randomly changed characters for the names so we can test the fuzzy_match. First we want to use the clean() function on each dataset so we can minimize the differences between the two columns we are going to join by.

congress <- clean(congress, name, selected = ",", suffixes = T, prefixes = T)
congress[1:20,1]

So now we have removed the prefixes, suffixes and commas from each column of names.

df <- fuzzy_match(congress, politwoops, name, full_name, cutoff = .1, 
                  join_type = "inner", sort = "desc")
df[1:15,]

Using the sort argument sorted the table in descending order of the match_vals column so the row at top is the least accurate. Looking at the by.x and by.y column, it's clear that this is the correct merge. The probability of having a mismatch further down is low since the first row had the biggest string distance between the matching columns.

hkarp1/fuz.merge documentation built on Sept. 2, 2020, 12:05 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hkarp1/fuz.merge
This package provides functions to implement and aid in Fuzzy Matching

In hkarp1/fuz.merge: This package provides functions to implement and aid in Fuzzy Matching

The Function

Data Sets

Example: Inner Join

R Package Documentation

Browse R Packages

We want your feedback!

hkarp1/fuz.merge This package provides functions to implement and aid in Fuzzy Matching

In hkarp1/fuz.merge: This package provides functions to implement and aid in Fuzzy Matching

The Function

Data Sets

Example: Inner Join

R Package Documentation

Browse R Packages

We want your feedback!

hkarp1/fuz.merge
This package provides functions to implement and aid in Fuzzy Matching