knitr::opts_chunk$set(fig.width = 7, fig.height = 7, message = FALSE, warning = FALSE)
``` {r echo = FALSE} library(tibble) library(dplyr) br <- tibble( name = c("A.-B. SECURITY", "Armada Security Canada", "Halfway River Safety Limited", "RNN Sales & Réntals", "Tim Tom Construction & Concrete"), address = c("Unit 212, 833 103 Ave", "9605 14 St", "801 102 Ave, Ex dock", "P.O. Box 143, Main Stn", "1205 116th Ave, #499"), postal_code = c("V1G2G2", "V1G3Y1", "V1G2B4", "V1G4E9", "V1G4P5"), city = rep_len("Dawson Creek", 5), province = rep_len("59", 5) ) %>% tibble::rownames_to_column(var = "id")
other <- br %>% filter(!grepl(pattern = "^Tim", x = name)) %>% select(-id, -city, -province) %>% sample_frac(size = 1)
This is a description of how to match things using the ```matchtools``` package. We'll start with a list of firms, including names, addresses, and postal codes. The goal is to match that list with another list of firms. This is calling matching, entity resolution, document retrieval or record linkage. This is an example of a Business Register (BR), with some modified names and addresses taken from the [Aboriginal Business Directory](http://www.ic.gc.ca/eic/site/ccc_bt-rec_ec.nsf/eng/h_00011.html): ``` {r, echo = TRUE} br
And suppose we have another dataset that we want to match to, which, luckily, seems to have most of the same firms in it: ``` {r, echo = TRUE} other
We'd like to match these two datasets. ## Steps 1. Prepare and process data using ```standardize``` and ```fix_unit_names``` 2. Generate a ```tbl``` that has blocked (postal code, postal code) pairs 3. Merge the firm datasets (```br``` and ```other```, here) onto the block ```tbl``` by postal codes to get candidate matches (here---does it matter if I merge on one side or the other? because postal codes can be different vintages in the different datasets?) ### Step 1---prepare and process the firm names and addresses using ```standardize()``` and ```fix_unit_names()``` First, make sure you load the ```matchtools``` package, then apply ```standardize()``` to the firm name. ```r library(matchtools) br <- br %>% select(name, address, postal_code) %>% mutate(name = standardize(name, dictionary = company_dictionary)) br
The name standardization takes a dictionary (supplied in the package as company_dictionary
, but you can input your own---it must be a tbl
with two columns, word
and standard
). The standardization corrects the encoding, removes punctuation, accents and extra whitespace, converts it all to lowercase, and then converts all entries in the dictionary (e.g., limited
to ltd
).
br <- br %>% mutate(address = address %>% standardize(dictionary = address_dictionary) %>% fix_unit_names()) br
The address standardization is similar, with a similar dictionary (but this time changes things like suite
to ste
and street
to st
). The standardization corrects the encoding, removes punctuation, accents and extra whitespace, converts it all to lowercase, and then converts all entries in the dictionary (e.g., limited
to ltd
).
Then, the fix_unit_names()
function converts addresses with apartments/units/suites into a common format "###-### main st", and switches extraneous explanatory text to the end. E.g., two identical addresses, written in different ways, "Simpsons res., Unit 212, 742 Evergreen Terrace" and "742 Evergreen Ter unit 212, c/o Marge", are converted to strings that are more comparable:
"Simpsons res., Unit 212, 742 Evergreen Terrace" %>% standardize(dictionary = address_dictionary) %>% fix_unit_names() "742 Evergreen Ter unit 212, c/o Marge" %>% standardize(dictionary = address_dictionary) %>% fix_unit_names()
Of course, we also need to standardize the other dataset too:
other <- other %>% select(name, address, postal_code) %>% mutate(name = standardize(name, dictionary = company_dictionary)) other <- other %>% mutate(address = address %>% standardize(dictionary = address_dictionary) %>% fix_unit_names())
fuzzy_block
library(postalcodes) # find the postal codes that matter for these datasets: # (later, make sure I have checks for whether the postal codes are in the dataset---won't always be true) postal_input <- unique(br$postal_code, other$postal_code) block <- fuzzy_block(postal_input = postal_input, postal_coords = postalcodes::postal_coords) block %>% arrange(postalcode.x, d)
d
is the distance (in km) between the centroids of the postal codes. We take this block tbl to the next step.
generate_matches(br, other, block = block) %>% select(name.x, name.y, name_cos, everything())
Done! Fuzzy blocked candidate matches!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.