In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

knitr::opts_chunk$set(fig.width = 7, fig.height = 7, message = FALSE, warning = FALSE)

``` {r echo = FALSE} library(tibble) library(dplyr) br <- tibble( name = c("A.-B. SECURITY", "Armada Security Canada", "Halfway River Safety Limited", "RNN Sales & Réntals", "Tim Tom Construction & Concrete"), address = c("Unit 212, 833 103 Ave", "9605 14 St", "801 102 Ave, Ex dock", "P.O. Box 143, Main Stn", "1205 116th Ave, #499"), postal_code = c("V1G2G2", "V1G3Y1", "V1G2B4", "V1G4E9", "V1G4P5"), city = rep_len("Dawson Creek", 5), province = rep_len("59", 5) ) %>% tibble::rownames_to_column(var = "id")

other <- br %>% filter(!grepl(pattern = "^Tim", x = name)) %>% select(-id, -city, -province) %>% sample_frac(size = 1)

This is a description of how to match things using the ```matchtools``` package. We'll start with a list of firms, including names, addresses, and postal codes. The goal is to match that list with another list of firms. This is calling matching, entity resolution, document retrieval or record linkage.

This is an example of a Business Register (BR), with some modified names and addresses taken from the [Aboriginal Business Directory](http://www.ic.gc.ca/eic/site/ccc_bt-rec_ec.nsf/eng/h_00011.html):
``` {r, echo = TRUE}
br

And suppose we have another dataset that we want to match to, which, luckily, seems to have most of the same firms in it: ``` {r, echo = TRUE} other

We'd like to match these two datasets. 

## Steps

1. Prepare and process data using ```standardize``` and ```fix_unit_names```
2. Generate a ```tbl``` that has blocked (postal code, postal code) pairs
3. Merge the firm datasets (```br``` and ```other```, here) onto the block ```tbl``` by postal codes to get candidate matches (here---does it matter if I merge on one side or the other? because postal codes can be different vintages in the different datasets?)


### Step 1---prepare and process the firm names and addresses using ```standardize()``` and ```fix_unit_names()```
First, make sure you load the ```matchtools``` package, then apply ```standardize()``` to the firm name.
```r
library(matchtools)
br <- br %>%
       select(name, address, postal_code) %>%
       mutate(name = standardize(name, dictionary = company_dictionary))
br

The name standardization takes a dictionary (supplied in the package as company_dictionary, but you can input your own---it must be a tbl with two columns, word and standard). The standardization corrects the encoding, removes punctuation, accents and extra whitespace, converts it all to lowercase, and then converts all entries in the dictionary (e.g., limited to ltd).

br <- br %>% 
        mutate(address = address %>% 
                          standardize(dictionary = address_dictionary) %>% 
                          fix_unit_names())
br

The address standardization is similar, with a similar dictionary (but this time changes things like suite to ste and street to st). The standardization corrects the encoding, removes punctuation, accents and extra whitespace, converts it all to lowercase, and then converts all entries in the dictionary (e.g., limited to ltd).

Then, the fix_unit_names() function converts addresses with apartments/units/suites into a common format "###-### main st", and switches extraneous explanatory text to the end. E.g., two identical addresses, written in different ways, "Simpsons res., Unit 212, 742 Evergreen Terrace" and "742 Evergreen Ter unit 212, c/o Marge", are converted to strings that are more comparable:

"Simpsons res., Unit 212, 742 Evergreen Terrace" %>% 
  standardize(dictionary = address_dictionary) %>% 
  fix_unit_names()
"742 Evergreen Ter unit 212, c/o Marge" %>% 
  standardize(dictionary = address_dictionary) %>% 
  fix_unit_names()

Of course, we also need to standardize the other dataset too:

other <- other %>%
       select(name, address, postal_code) %>%
       mutate(name = standardize(name, dictionary = company_dictionary))
other <- other %>% 
        mutate(address = address %>% 
                          standardize(dictionary = address_dictionary) %>% 
                          fix_unit_names())

Step 2---generate fuzzy blocked pairs of postal codes using `fuzzy_block`

library(postalcodes)
# find the postal codes that matter for these datasets:
# (later, make sure I have checks for whether the postal codes are in the dataset---won't always be true)
postal_input <- unique(br$postal_code, other$postal_code)
block <- fuzzy_block(postal_input = postal_input, postal_coords = postalcodes::postal_coords)
block %>% arrange(postalcode.x, d)

d is the distance (in km) between the centroids of the postal codes. We take this block tbl to the next step.

Step 3---merge firm tbls onto the postal code block to generate candidate matches

generate_matches(br, other, block = block) %>% select(name.x, name.y, name_cos, everything())

Done! Fuzzy blocked candidate matches!

tweed1e/matchtools documentation built on May 29, 2019, 10:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tweed1e/matchtools
Tools For Matching Firms From Different Datasets

In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

Step 2---generate fuzzy blocked pairs of postal codes using `fuzzy_block`

Step 3---merge firm tbls onto the postal code block to generate candidate matches

R Package Documentation

Browse R Packages

We want your feedback!

tweed1e/matchtools Tools For Matching Firms From Different Datasets

In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

Step 2---generate fuzzy blocked pairs of postal codes using fuzzy_block

Step 3---merge firm tbls onto the postal code block to generate candidate matches

R Package Documentation

Browse R Packages

We want your feedback!

tweed1e/matchtools
Tools For Matching Firms From Different Datasets

Step 2---generate fuzzy blocked pairs of postal codes using `fuzzy_block`