knitr::opts_chunk$set(fig.width = 7, fig.height = 5.5, message = FALSE, warning = FALSE)
``` {r echo = FALSE} set.seed(403) library(tibble) library(dplyr) library(purrr) library(fakedata)
br <- fakedata::fake_br() tcod <- fakedata::fake_tcod(br = br)
<!-- ## TODO --> <!-- 1. Organize sections/headers better. --> <!-- 2. Write the Appendix / official documentation, and use the same terminology from that. --> <!-- 3. And then. --> ## Problem outline Here we have two datasets, ```br``` (one observation has one firm) and ```tcod``` (one observation has two firms), that we'd like to match. In each dataset, there will be firm names and firm addresses. Our goal is to have a database that links each firm in the ```tcod``` to a firm in the ```br```. ### Steps: 1. Filter and process the datasets to get comparable observations, which should be (firm, address) pairs. 2. The ```tcod``` filter produces two datasets: ```t[n,a]``` (a set with names and addresses) and ```t[a]``` (a set with addresses but no names). 3. Fuzzy block on postal codes to create ```q[n,a] = t[n,a] x br```, and ```q[a] = t[a] x br``` 4. Predict matches on ```q[n,a]``` using names and addresses (call this model ```m[n,a]```) 5. Predict matches on ```q[n,a]``` using only addresses (call this model ```m[a]```) 6. Evaluate both using F1 score (combination of precision and recall) 7. Use ```m[a]``` model to predict matches on ```q[a]``` 8. Merge matches from ```q[n,a]``` and ```q[a]``` back onto TCOD ### Focus For this document, I'll focus on Steps (3-6), since I'm using fake data. <!-- One strategy is to compile all the firm names in the ```tcod``` into one list (ignoring their status as shippers and receivers). This will work for now, but may need to change once we try to use shipment commodity information to match (since we'll need to identify whether the firm is shipping/producing or receiving/using the commodity). --> ## The data ``` {r echo = TRUE} br tcod %>% select(name.x, name.y, everything(), -id.x,-id.y)
tcod
Once we get a complete list of firms t[n,a]
(which I'll call t_firms
) from the tcod
, we can match them to the br
. Since I'm blocking on postal codes, I'll need to keep this along with the firm and address pairs.
``` {r echo = TRUE} t_firms <- bind_rows( tcod %>% select(name = name.x, address = address.x, postal_code = postal_code.x), tcod %>% select(name = name.y, address = address.y, postal_code = postal_code.y)) %>% distinct() t_firms
## Standardize both firm lists, ```t[n,a]``` and ```br``` This fixes encodings, standardizes common firm and address words (e.g., "street" to "st") and rearranges apartment unit numbers to a common format. ``` {r echo = TRUE} library(matchtools) process <- . %>% select(name, address, postal_code) %>% mutate(name = standardize(name, dictionary = company_dictionary), address = address %>% standardize(dictionary = address_dictionary) %>% fix_unit_names()) br <- br %>% process() t_firms <- t_firms %>% process()
q[n,a] = t[n,a] x br
Here, we'll skip fuzzy_block()
bc generate_matches()
should do it automatically, although if you do this for the whole country at once, you may want to use par_fuzzy_block()
(the parallel version), or split the firms on some geographic variable and block that way (likely faster). This creates the set of candidate matches to see if we can predict which observations match.
``` {r echo = TRUE} matches <- generate_matches(br, t_firms) matches %>% select(name.x, name.y, everything())
## Ugly hand-picking matches step (Not shown---pore over the ```matches``` dataset by hand, picking correct matches and non-matches. Add an indicator variable called ```match``` that is equal to 1 if the observation is a match, and 0 if not. Here, because of the way I set up the data, I fake it.) ``` {r echo = FALSE} matches <- matches %>% mutate(match = (near(name_cos, 0) & near(add_cos, 0)) %>% as.numeric() %>% factor(levels = c("0","1"))) %>% select(match, name.x, name.y, everything())
m[n,a]
and m[a]
)We have a dataset with names, addresses, and matches, q[n,a]
. But because they're fake and exact matches, the models won't really work. We need to add noise to the fake data first.
``` {r echo = TRUE}
mistakes
just drops alphabet characters at random from the names and addresses,mistakes <- function(text) { gsub(pattern = paste0("[", paste0(sample(letters, size = sample.int(n = 10, size = 1)), collapse = ''), "]"), replacement = '', # '' means drop those characters x = text) }
matches <- bind_rows(replicate(50, matches, simplify = FALSE))
matches <- matches %>% ungroup() %>% mutate(name.x = name.x %>% map(mistakes) %>% unlist(), address.x = address.x %>% map(mistakes) %>% unlist(), name.y = name.y %>% map(mistakes) %>% unlist(), address.y = address.y %>% map(mistakes) %>% unlist(), name_cos = stringdist::stringdist(name.x, name.y, method = 'cosine', q = 2, p = 0.1), add_cos = stringdist::stringdist(address.x, address.y, method = 'cosine', q = 2, p = 0.1))
That's our data. We'll need to set up some ML libraries and things to run our models: ``` {r echo = TRUE} library(caret) # library to handle ML training/testing library(MLmetrics) # library for Precision/Recall # F1 <-> Metric to train the model on f1_score <- function(data, lev = NULL, model = NULL) { pred <- data$pred obs <- data$obs scores(pred, obs) } scores <- function(pred, obs) { prec <- Precision(y_pred = pred, y_true = obs, positive = '1') rec <- Recall(y_pred = pred, y_true = obs, positive = '1') # BETA: increases relative importance of recall f1_val <- FBeta_Score(y_pred = pred, y_true = obs, positive = '1', beta = 1) c(F1 = f1_val, precision = prec, recall = rec) } # Simple; may want to use repeatedcv method in production. fit_control <- trainControl(method = 'boot', summaryFunction = f1_score) # glm model; logit + F1 score to evaluate glm_fit <- function(formula, data) { train(formula, data = data, # this argument might be complicated. method = 'glm', metric = 'F1', trControl = fit_control, family = binomial(link = "logit")) }
Formulas for our two models, m[n,a]
and m[a]
:
``` {r echo = TRUE}
name_addr <- match ~ name_cos + add_cos addr <- match ~ add_cos
Split into training and test sets ``` {r echo = TRUE} index <- createDataPartition(matches$match, p = 0.9, list = FALSE, times = 1) train <- matches[index, ] test <- matches[-index, ]
m[n,a]
(names + addresses)``` {r echo = TRUE} name_addr_fit <- glm_fit(formula = name_addr, data = train) summary(name_addr_fit)
### Run model ```m[a]``` (addresses only) First, run the model that predicts the LHS variable ```match```. ``` {r echo = TRUE} addr_fit <- glm_fit(formula = addr, data = train) summary(addr_fit)
Then add the predictions to the data, as well as the probabilities of matches for each model. (I'm less interested in the exact predictions from m[n,a]
right now.)
``` {r echo = TRUE}
matches$pred <- predict(addr_fit, newdata = matches) # m[a] only
matches <- bind_cols(matches, predict(name_addr_fit, newdata = matches, type = 'prob'))
matches <- bind_cols(matches, predict(addr_fit, newdata = matches, type = 'prob'))
## Probability of match, by model (```m[n,a]``` vs. ```m[a]```) on subset ```q[n,a]``` ``` {r echo = TRUE} # Probability plot things. g1 <- ggplot(matches, aes(x = `1`, fill = match)) + geom_density(alpha = 0.35) + xlim(0,1) + labs(title = 'Name+address model', x = 'Predicted match probability (name/address model)', y = 'Density') + scale_fill_discrete(name = 'True Match', labels = c('No', 'Yes')) g2 <- ggplot(matches, aes(x = `11`, fill = match)) + geom_density(alpha = 0.35) + xlim(0,1) + labs(title = 'Address-only model', x = 'Predicted match probability (address-only model)', y = 'Density') + scale_fill_discrete(name = 'True Match', labels = c('No', 'Yes')) p <- gridExtra::grid.arrange(g1, g2, ncol=1) p
Since our t[n,a]
dataset sometimes has different firms with the same address, the address-only model m[a]
won't always be able to distinguish between two similar candidate matches. One way to alleviate this (depending on the application), is to change the cutoff that the logistic regression uses to predict a match from 0.5 to some other number. Since we've decided we care more about reducing false negatives (reducing Type I error) than reducing false positives (reducing Type II error), we may want to reduce the cutoff from 0.5 to some other number. (To imagine the effect, look at the bottom panel of the figure above, and move a vertical line from 0.5 down to 0.25 and 0.75.)
``` {r echo = TRUE}
f <- function(tbl, cutoff) {
tbl %>% mutate(fit = ifelse(11
> cutoff, 1, 0)) %>% .$fit
}
cutoff_scores <- function(tbl, cutoff) { # print(cutoff) scores(tbl %>% f(cutoff), tbl$match) }
s <- seq(0.25, 0.75, length.out = 50)
cut_results <- bind_cols(cutoff = s, sapply(X = s, cutoff_scores, tbl = matches) %>% t() %>% as.tibble())
cutp <- cut_results %>% mutate(text = ifelse(cutoff==0.01 | cutoff==0.49, cutoff %>% round(2), '')) %>% ggplot(aes(x = precision, y = recall, label = text)) + geom_text(vjust = 0, hjust = 0, nudge_x = -0.05, nudge_y = -0.02) + geom_point() + xlim(0,1) + ylim(0,1) + labs(x = 'Precision', y = 'Recall') cutp ```
Here, we can increase recall at the expense of precision by changing the cutoff. The exact cutoff we choose depends on the application.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.