In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

knitr::opts_chunk$set(fig.width = 7, fig.height = 5.5, message = FALSE, warning = FALSE)

``` {r echo = FALSE} set.seed(403) library(tibble) library(dplyr) library(purrr) library(fakedata)

need some matches that have the same addresses (and postal code).

br <- fakedata::fake_br() tcod <- fakedata::fake_tcod(br = br)

br <- tibble(

name = c("A.-B. SECURITY",

"Armada Security Canada",

"Halfway River Safety Limited",

"RNN Sales & Réntals",

"Tim Tom Construction & Concrete",

"All Rhodes Pilot Service",

"Canadian Quality Control Inc.",

"Falcon Contracting Ltd.",

"1 Nation Distribution",

"Blanshard Group",

"Bulkley Valley Motel Ltd."),

address = c("Unit 212, 833 103 Ave",

"9605 14 St",

"801 102 Ave",

"P.O. Box 143, Main Stn",

"1205 116th Ave, #499",

"7485 Sunhill Rd",

"801 102 Ave",

"8555 Pacific St",

"101-4442 West Saanich Rd",

"P.O. Box 143, Main Stn"),

postal_code = c("V1G2G2", "V1G3Y1", "V1G2B4", "V1G4E9", "V1G4P5",

"V2N6E7", "V1G2B4", "V1G3Y1", "V1G3Y1", "V1G4E9", "V1G4E9"),

city = rep_len("Dawson Creek", 11),

province = rep_len("59", 11)

) %>% tibble::rownames_to_column(var = "id")

tcod <- expand.grid(1:dim(br)[1], 1:dim(br)[1]) %>%

tibble::as.tibble() %>%

rename(id.x = Var1, id.y = Var2) %>%

mutate_all(as.character) %>%

filter(id.x != id.y)

tcod <- tcod %>%

left_join(br, by = c("id.x" = "id")) %>%

left_join(br, by = c("id.y" = "id"))

tcod <- tcod %>% sample_frac(size = 0.9, replace = TRUE)

now mess with the names and addresses of the tcod a little bit.

<!-- ## TODO -->

<!-- 1. Organize sections/headers better. -->
<!-- 2. Write the Appendix / official documentation, and use the same terminology from that. -->
<!-- 3. And then. -->

## Problem outline

Here we have two datasets, ```br``` (one observation has one firm) and ```tcod``` (one observation has two firms), that we'd like to match. In each dataset, there will be firm names and firm addresses. Our goal is to have a database that links each firm in the ```tcod``` to a firm in the ```br```. 

### Steps:

1. Filter and process the datasets to get comparable observations, which should be (firm, address) pairs.
2. The ```tcod``` filter produces two datasets: ```t[n,a]``` (a set with names and addresses) and ```t[a]``` (a set with addresses but no names).
3. Fuzzy block on postal codes to create ```q[n,a] = t[n,a] x br```, and ```q[a] = t[a] x br```
4. Predict matches on ```q[n,a]``` using names and addresses (call this model ```m[n,a]```)
5. Predict matches on ```q[n,a]``` using only addresses (call this model ```m[a]```)
6. Evaluate both using F1 score (combination of precision and recall)
7. Use ```m[a]``` model to predict matches on ```q[a]```
8. Merge matches from ```q[n,a]``` and ```q[a]``` back onto TCOD

### Focus

For this document, I'll focus on Steps (3-6), since I'm using fake data.

<!-- One strategy is to compile all the firm names in the ```tcod``` into one list (ignoring their status as shippers and receivers). This will work for now, but may need to change once we try to use shipment commodity information to match (since we'll need to identify whether the firm is shipping/producing or receiving/using the commodity). -->

## The data

``` {r echo = TRUE}
br
tcod %>% select(name.x, name.y, everything(), -id.x,-id.y)

Compile firm list from `tcod`

Once we get a complete list of firms t[n,a] (which I'll call t_firms) from the tcod, we can match them to the br. Since I'm blocking on postal codes, I'll need to keep this along with the firm and address pairs.

``` {r echo = TRUE} t_firms <- bind_rows( tcod %>% select(name = name.x, address = address.x, postal_code = postal_code.x), tcod %>% select(name = name.y, address = address.y, postal_code = postal_code.y)) %>% distinct() t_firms

## Standardize both firm lists, ```t[n,a]``` and ```br```

This fixes encodings, standardizes common firm and address words (e.g., "street" to "st") and rearranges apartment unit numbers to a common format.

``` {r echo = TRUE}
library(matchtools)
process <- . %>%
       select(name, address, postal_code) %>%
       mutate(name = standardize(name, dictionary = company_dictionary),
              address = address %>% 
                          standardize(dictionary = address_dictionary) %>% 
                          fix_unit_names())

br <- br %>% process()
t_firms <- t_firms %>% process()

Generate matches, `q[n,a] = t[n,a] x br`

Here, we'll skip fuzzy_block() bc generate_matches() should do it automatically, although if you do this for the whole country at once, you may want to use par_fuzzy_block() (the parallel version), or split the firms on some geographic variable and block that way (likely faster). This creates the set of candidate matches to see if we can predict which observations match.

``` {r echo = TRUE} matches <- generate_matches(br, t_firms) matches %>% select(name.x, name.y, everything())

## Ugly hand-picking matches step

(Not shown---pore over the ```matches``` dataset by hand, picking correct matches and non-matches. Add an indicator variable called ```match``` that is equal to 1 if the observation is a match, and 0 if not. Here, because of the way I set up the data, I fake it.)

``` {r echo = FALSE}
matches <- matches %>% 
  mutate(match = (near(name_cos, 0) & near(add_cos, 0)) %>% 
                    as.numeric() %>% factor(levels = c("0","1"))) %>% 
  select(match, name.x, name.y, everything())

Predict matches (models `m[n,a]` and `m[a]`)

We have a dataset with names, addresses, and matches, q[n,a]. But because they're fake and exact matches, the models won't really work. We need to add noise to the fake data first.

``` {r echo = TRUE}

only problem: they'll be exact matches because they're fake;

`mistakes` just drops alphabet characters at random from the names and addresses,

which makes the matches not exact (like real data, usually)

mistakes <- function(text) { gsub(pattern = paste0("[", paste0(sample(letters, size = sample.int(n = 10, size = 1)), collapse = ''), "]"), replacement = '', # '' means drop those characters x = text) }

get 50 copies of the data so the regressions look more like reality

matches <- bind_rows(replicate(50, matches, simplify = FALSE))

mess with the data so the glm doesn't crash

then we recalculate the cosine distance between names and addresses

matches <- matches %>% ungroup() %>% mutate(name.x = name.x %>% map(mistakes) %>% unlist(), address.x = address.x %>% map(mistakes) %>% unlist(), name.y = name.y %>% map(mistakes) %>% unlist(), address.y = address.y %>% map(mistakes) %>% unlist(), name_cos = stringdist::stringdist(name.x, name.y, method = 'cosine', q = 2, p = 0.1), add_cos = stringdist::stringdist(address.x, address.y, method = 'cosine', q = 2, p = 0.1))

That's our data. We'll need to set up some ML libraries and things to run our models:

``` {r echo = TRUE}
library(caret) # library to handle ML training/testing
library(MLmetrics) # library for Precision/Recall

# F1 <-> Metric to train the model on
f1_score <- function(data, lev = NULL, model = NULL) {
  pred <- data$pred
  obs <- data$obs
  scores(pred, obs)
}

scores <- function(pred, obs) {
  prec <- Precision(y_pred = pred, y_true = obs, positive = '1')
  rec <- Recall(y_pred = pred, y_true = obs, positive = '1')
  # BETA: increases relative importance of recall
  f1_val <- FBeta_Score(y_pred = pred, y_true = obs, positive = '1', beta = 1)
  c(F1 = f1_val, precision = prec, recall = rec)
}

# Simple; may want to use repeatedcv method in production.
fit_control <- trainControl(method = 'boot', summaryFunction = f1_score)

# glm model; logit + F1 score to evaluate
glm_fit <- function(formula, data) {
  train(formula, 
        data = data, # this argument might be complicated.
        method = 'glm',
        metric = 'F1',
        trControl = fit_control, 
        family = binomial(link = "logit"))
}

Formulas for our two models, m[n,a] and m[a]: ``` {r echo = TRUE}

Formulas

name_addr <- match ~ name_cos + add_cos addr <- match ~ add_cos

Split into training and test sets
``` {r echo = TRUE}
index <- createDataPartition(matches$match, p = 0.9, list = FALSE, times = 1)
train <- matches[index, ]
test <- matches[-index, ]

Run model `m[n,a]` (names + addresses)

``` {r echo = TRUE} name_addr_fit <- glm_fit(formula = name_addr, data = train) summary(name_addr_fit)

### Run model ```m[a]``` (addresses only)

First, run the model that predicts the LHS variable ```match```.
``` {r echo = TRUE}
addr_fit <- glm_fit(formula = addr, data = train)
summary(addr_fit)

Add predictions to the dataset

Then add the predictions to the data, as well as the probabilities of matches for each model. (I'm less interested in the exact predictions from m[n,a] right now.) ``` {r echo = TRUE} matches$pred <- predict(addr_fit, newdata = matches) # m[a] only matches <- bind_cols(matches, predict(name_addr_fit, newdata = matches, type = 'prob')) matches <- bind_cols(matches, predict(addr_fit, newdata = matches, type = 'prob'))

## Probability of match, by model (```m[n,a]``` vs. ```m[a]```) on subset ```q[n,a]```
``` {r echo = TRUE}
# Probability plot things.
g1 <- ggplot(matches, aes(x = `1`, fill = match)) + 
        geom_density(alpha = 0.35) + xlim(0,1) + 
        labs(title = 'Name+address model', 
             x = 'Predicted match probability (name/address model)', 
             y = 'Density') + 
        scale_fill_discrete(name = 'True Match', labels = c('No', 'Yes'))

g2 <- ggplot(matches, aes(x = `11`, fill = match)) + 
        geom_density(alpha = 0.35) + xlim(0,1) + 
        labs(title = 'Address-only model', 
             x = 'Predicted match probability (address-only model)',
             y = 'Density') + 
             scale_fill_discrete(name = 'True Match', labels = c('No', 'Yes'))

p <- gridExtra::grid.arrange(g1, g2, ncol=1)
p

Precision / Recall Tradeoff

Since our t[n,a] dataset sometimes has different firms with the same address, the address-only model m[a] won't always be able to distinguish between two similar candidate matches. One way to alleviate this (depending on the application), is to change the cutoff that the logistic regression uses to predict a match from 0.5 to some other number. Since we've decided we care more about reducing false negatives (reducing Type I error) than reducing false positives (reducing Type II error), we may want to reduce the cutoff from 0.5 to some other number. (To imagine the effect, look at the bottom panel of the figure above, and move a vertical line from 0.5 down to 0.25 and 0.75.)

``` {r echo = TRUE}

Cutoff thing; need to add probabilities, get results, etc.

f <- function(tbl, cutoff) { tbl %>% mutate(fit = ifelse(11 > cutoff, 1, 0)) %>% .$fit }

cutoff_scores <- function(tbl, cutoff) { # print(cutoff) scores(tbl %>% f(cutoff), tbl$match) }

s <- seq(0.25, 0.75, length.out = 50)

cut_results <- bind_cols(cutoff = s, sapply(X = s, cutoff_scores, tbl = matches) %>% t() %>% as.tibble())

cutp <- cut_results %>% mutate(text = ifelse(cutoff==0.01 | cutoff==0.49, cutoff %>% round(2), '')) %>% ggplot(aes(x = precision, y = recall, label = text)) + geom_text(vjust = 0, hjust = 0, nudge_x = -0.05, nudge_y = -0.02) + geom_point() + xlim(0,1) + ylim(0,1) + labs(x = 'Precision', y = 'Recall') cutp ```

Here, we can increase recall at the expense of precision by changing the cutoff. The exact cutoff we choose depends on the application.

tweed1e/matchtools documentation built on May 29, 2019, 10:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tweed1e/matchtools Tools For Matching Firms From Different Datasets

In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

need some matches that have the same addresses (and postal code).

br <- tibble(

name = c("A.-B. SECURITY",

"Armada Security Canada",

"Halfway River Safety Limited",

"RNN Sales & Réntals",

"Tim Tom Construction & Concrete",

"All Rhodes Pilot Service",

"Canadian Quality Control Inc.",

"Falcon Contracting Ltd.",

"1 Nation Distribution",

"Blanshard Group",

"Bulkley Valley Motel Ltd."),

address = c("Unit 212, 833 103 Ave",

"9605 14 St",

"801 102 Ave",

"P.O. Box 143, Main Stn",

"1205 116th Ave, #499",

"7485 Sunhill Rd",

"801 102 Ave",

"8555 Pacific St",

"8555 Pacific St",

"101-4442 West Saanich Rd",

"P.O. Box 143, Main Stn"),

postal_code = c("V1G2G2", "V1G3Y1", "V1G2B4", "V1G4E9", "V1G4P5",

"V2N6E7", "V1G2B4", "V1G3Y1", "V1G3Y1", "V1G4E9", "V1G4E9"),

city = rep_len("Dawson Creek", 11),

province = rep_len("59", 11)

) %>% tibble::rownames_to_column(var = "id")

tcod <- expand.grid(1:dim(br)[1], 1:dim(br)[1]) %>%

tibble::as.tibble() %>%

rename(id.x = Var1, id.y = Var2) %>%

mutate_all(as.character) %>%

filter(id.x != id.y)

tcod <- tcod %>%

left_join(br, by = c("id.x" = "id")) %>%

left_join(br, by = c("id.y" = "id"))

tcod <- tcod %>% sample_frac(size = 0.9, replace = TRUE)

now mess with the names and addresses of the tcod a little bit.

Compile firm list from tcod

Generate matches, q[n,a] = t[n,a] x br

Predict matches (models m[n,a] and m[a])

only problem: they'll be exact matches because they're fake;

mistakes just drops alphabet characters at random from the names and addresses,

which makes the matches not exact (like real data, usually)

get 50 copies of the data so the regressions look more like reality

mess with the data so the glm doesn't crash

then we recalculate the cosine distance between names and addresses

Formulas

Run model m[n,a] (names + addresses)

Add predictions to the dataset

Precision / Recall Tradeoff

Cutoff thing; need to add probabilities, get results, etc.

R Package Documentation

Browse R Packages

We want your feedback!

tweed1e/matchtools
Tools For Matching Firms From Different Datasets

Compile firm list from `tcod`

Generate matches, `q[n,a] = t[n,a] x br`

Predict matches (models `m[n,a]` and `m[a]`)

`mistakes` just drops alphabet characters at random from the names and addresses,

Run model `m[n,a]` (names + addresses)