knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(Rmatch)
library(scales); library(kableExtra); library(tidyverse); library(stringi)
show_table <- function(.tab, .align = "l") { .tab %>% kableExtra::kbl(escape = FALSE, align = .align) %>% kableExtra::kable_paper() %>% kableExtra::kable_styling( font_size = 14, html_font = "Times New Roman", bootstrap_options = "condensed" ) }
For the basic usage we need 2 Dataframes.
A Source Dataframe: The dataframe you want to match
A Target Dataframe: The dataframe you want to match your source dataframe against
As an example we will use two build-in example tables (i.e. [table_source]{.ul} and [table_target]{.ul}) which contain firm names and other identifying information such as country, city and address
But before we start, let's have a look on those tables and check if our dataframes are in the right shape. We can use the function check_data() for this.
Source Dataframe
show_table(head(table_source))
Target Dataframe
show_table(head(table_target))
Check Data There are a couple of important requirements before you are able to match using this package.
Source and Target dataframe need a column called id and this id column MUST BE UNIQUE
Source and Target dataframe need to have additional overlapping column names. (Those are the options for matching)
The columns you want to match on should be unique in some combination. E.g. if you want to match n a name and a city at least the combination of those two variables must be unique.
For our example dataframes all these requirements are fulfilled.
check_data(table_source, table_target)
But before we start matching it is always advisable to perform some string standardization on the variables you want to match on (i.e. ensure that we have only upper/lower case, remove punctions, ...). You can use the build-in function standardize_data() for that purpose.
standardize_data() takes 3 arguments
.data: The dataframe you want to standardize
.cols: The columns you want to standardize
.fun: A function to standardize strings. You can leave it NULL, then the build-in function standardize_str() is used.
match_cols <- c("name", "iso3", "city", "address") tab_source <- standardize_data(table_source, match_cols) tab_target <- standardize_data(table_target, match_cols)
As mentioned you could also use a custom function.
.tmp <- standardize_data(table_source, match_cols, .fun = tolower)
show_table(head(.tmp))
After standardizing I'd recommend that you check your data again.
check_data(tab_source, tab_target)
Now that we ensured that our data is reasonably standardized and in the right shape, we can start matching. We will use the function match_data()
match_data() takes several (also optional) arguments:
.source: Source dataframe
.target: Target Dataframe
.cols: The columns you want to match on
.join: If not NULL, data will be left-joined on those columns first, and matches (both in source and target) will be removed from the fuzzy matching.
.must_match: Columns that MUST match. This argument is rather important for 2 reasons.
It prevents the function to perform operations on unnecessary columns (Your data will be exactly matched upfront on those columns, so we don't need to perform similarity functions on this)
It will ease the memory requirements for really large matching tables (more later)
.max_match: Maximum nuber of matches you want to retrieve
.min_sim: Minimum similarity of you chosen method.
.method: Metric used for matching (see stringdist-metrics {stringdist})
.chunk: You can chunk your Source dataframe, in case you run into memory issues
.progress: You can show a progress bar (might be useful if you data is really big)
For our data we choose the following parameters
must <- "iso3" tab_match <- match_data( .source = tab_source, .target = tab_target, .cols = match_cols, .must_match = "iso3", .progress = TRUE, .max_match = 100 )
Let's have a look at the resulting dataframe.
show_table(head(tab_match))
For each of your matching columns (i.e. name, city, iso3, address) we get a similarity score defined by your method. The scores for the individual columns always have the prefix 'sim_'. The resulting matches are in a n-m format, meaning one row of your source dataframe can be matched to multiple rows of your target dataframe and vice-versa.
So in the next step we will deduplicate the data. But before we can do this, we need a strategy to select the best matches. We will use the build-in function score_data() for this purpose.
tab_score <- scores_data( .matches = tab_match, .source = tab_source, .target = tab_target, .must_match = must, .w_custom = c(name = .5, city = .1, address = .4) )
show_table(head(tab_score))
Now we deduplicate the data, using the scores we just retrieved, we also include a version that only performs a join on those data.
tab_simple <- dedup_data(tab_score, tab_source, tab_target, "mean_simple") tab_weight <- dedup_data(tab_score, tab_source, tab_target, "mean_weight") tab_custom <- dedup_data(tab_score, tab_source, tab_target, "mean_custom") tab_merged <- left_join( tab_source, tab_target, by = c("name", "iso3"), suffix = c("_s", "_t") ) %>% mutate(score = 1)
Let's verify that the data is now in a 1-1 format
verify <- function(.tab) { tab_ <- dplyr::filter(.tab, !is.na(id_s), !is.na(id_t)) cat( paste0("Duplicated Source IDs: ", any(duplicated(tab_[["id_s"]]))), paste0("Duplicated Target IDs: ", any(duplicated(tab_[["id_t"]]))), sep = "\n" ) } verify(tab_simple) verify(tab_weight) verify(tab_custom) verify(tab_merged)
Last let's evaluate how well the matching worked. You probably don't have a already matched sample yourself (I guess that's the whole purpose of this library), but I included an already matched sample as the build-in dataframe table_matches.
Let's write a quick function to evaluate the matches.
.matches <- tab_simple .training <- table_matches .name <- "Simple" .min <- 0 get_accuracy <- function(.matches, .training, .name, .min = 0) { tab_ <- .matches %>% # select(id_s, id_t, score) %>% mutate( id_t = if_else(score < .min, NA_character_, id_t), score = if_else(score < .min, NA_real_, score), ) %>% inner_join(select(.training, id_s, id = id_t), by = c("id_s")) %>% mutate(match = case_when( id_t == id & !is.na(id_t) ~ "c", id_t != id & !is.na(id_t) ~ "i", TRUE ~ "n" )) %>% summarise( Name = .name, `N (Match)` = sum(match %in% c("c", "i")), `Correct (%)` = sum(match == "c") / `N (Match)`, `Incorrect (%)` = sum(match == "i") / `N (Match)`, `N (Total)` = n(), `Non-Match (%)` = sum(match == "n") / `N (Total)`, ) }
tab_accuracy <- bind_rows( get_accuracy(tab_simple, table_matches, "Simple"), get_accuracy(tab_weight, table_matches, "Weight"), get_accuracy(tab_custom, table_matches, "Custom"), get_accuracy(tab_merged, table_matches, "Merged"), ) %>% arrange(desc(`N (Match)`)) %>% mutate( across(where(is.numeric), ~ if_else(is.na(.), 0, as.numeric(.))), across(matches("\\(%\\)"), ~ percent(., .01)), across(where(is.numeric), ~ comma(., 1)), )
show_table(tab_accuracy, .align = "c")
With the Unweighted Squared Mean we get a matching accuracy of 89.69%, which is not bad I guess. But 5.45% of your data is incorrectly matched, and 5.36% of the data is not matched. Depending on your use case the unmatched data is less critical, but incorrectly matched data might harm your analysis.
Let's quickly check how this will develop if we set the threshold higher
get_accuracy_threshold <- function(.matches, .training, .name) { map_dfr( .x = set_names(seq(0, .95, .05), seq(0, .95, .05)), .f = ~ get_accuracy(.matches, .training, .name, .x), .id = "min" ) %>% mutate(min = as.numeric(min)) }
tab_threshold <- bind_rows( get_accuracy_threshold(tab_simple, table_matches, "Simple"), get_accuracy_threshold(tab_weight, table_matches, "Weight"), get_accuracy_threshold(tab_custom, table_matches, "Custom"), get_accuracy_threshold(tab_merged, table_matches, "Merged"), ) %>% mutate( across(where(is.numeric), ~ if_else(is.na(.), 0, as.numeric(.))) )
tab_threshold %>% pivot_longer(matches("%")) %>% ggplot(aes(min, value, color = Name)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent) + geom_vline(xintercept = .5, linetype = "dashed") + geom_vline(xintercept = .8, linetype = "dashed") + facet_wrap(~ name, scales = "free_y") + theme_minimal() + theme(legend.position = "top")
We see, the higher the threshold, the better your matching.
methods <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex") match_method <- map( .x = set_names(methods, methods), .f = ~ match_data( .source = tab_source, .target = tab_target, .cols = match_cols, .join = c("name", "iso3"), .must_match = "iso3", .progress = FALSE, .max_match = 25, .method = .x ) ) score_method <- map( .x = match_method, .f = ~ scores_data( .matches = .x, .source = tab_source, .target = tab_target, .must_match = must, .w_custom = c(name = .5, city = .1, address = .4) ) ) simple_method <- map( .x = score_method, .f = ~ dedup_data(.x, tab_source, tab_target, "mean_simple") ) %>% `names<-`(paste0("Simple-", names(.))) weight_method <- map( .x = score_method, .f = ~ dedup_data(.x, tab_source, tab_target, "mean_weight") ) %>% `names<-`(paste0("Weight-", names(.))) custom_method <- map( .x = score_method, .f = ~ dedup_data(.x, tab_source, tab_target, "mean_custom") ) %>% `names<-`(paste0("Custom-", names(.))) accuracy_method <- imap_dfr( .x = c(simple_method, weight_method, custom_method), .f = ~ get_accuracy(.x, .training, .y) ) %>% arrange(desc(`N (Match)` * `Correct (%)`)) %>% mutate( across(where(is.numeric), ~ if_else(is.na(.), 0, as.numeric(.))), across(matches("\\(%\\)"), ~ percent(., .01)), across(where(is.numeric), ~ comma(., 1)), )
show_table(accuracy_method, .align = "c")
Especially for Company matching, we might have a lot of variety in how a legal form of a company is written (i.e. a public limited company can be written as plc, p.l.c, public limited company, ...). In the next step we try to standardize the legal forms.
tab_source_lf <- extract_legal_form(tab_source, "name") tab_target_lf <- extract_legal_form(tab_target, "name")
match_cols_lf <- c("name_std", "iso3", "legal_form", "city", "address") must_lf <- c("iso3", "legal_form") tab_match_lf <- match_data( .source = tab_source_lf, .target = tab_target_lf, .cols = match_cols_lf, .must_match = must_lf )
tab_score_lf <- scores_data( .matches = tab_match_lf, .source = tab_source_lf, .target = tab_target_lf, .must_match = must_lf, .w_custom = c(name_std = .5, city = .1, address = .4) )
tab_simple_lf <- dedup_data(tab_score_lf, tab_source_lf, tab_target_lf, "mean_simple") tab_weight_lf <- dedup_data(tab_score_lf, tab_source_lf, tab_target_lf, "mean_weight") tab_custom_lf <- dedup_data(tab_score_lf, tab_source_lf, tab_target_lf, "mean_custom") tab_merged_lf <- left_join( tab_source_lf, tab_target_lf, by = c("name_std", "iso3"), suffix = c("_s", "_t") ) %>% mutate(score = 1)
tab_accuracy_lf <- bind_rows( get_accuracy(tab_simple_lf, table_matches, "Simple"), get_accuracy(tab_weight_lf, table_matches, "Weight"), get_accuracy(tab_custom_lf, table_matches, "Custom"), get_accuracy(tab_merged_lf, table_matches, "Merged"), ) %>% arrange(desc(`N (Match)`)) %>% mutate( across(where(is.numeric), ~ if_else(is.na(.), 0, as.numeric(.))), across(matches("\\(%\\)"), ~ percent(., .01)), across(where(is.numeric), ~ comma(., 1)), )
show_table(tab_accuracy_lf, .align = "c")
tab_threshold_lf <- bind_rows( get_accuracy_threshold(tab_simple_lf, table_matches, "Simple"), get_accuracy_threshold(tab_weight_lf, table_matches, "Weight"), get_accuracy_threshold(tab_custom_lf, table_matches, "Custom"), get_accuracy_threshold(tab_merged_lf, table_matches, "Merged"), ) %>% mutate( across(where(is.numeric), ~ if_else(is.na(.), 0, as.numeric(.))) )
tab_threshold_lf %>% pivot_longer(matches("%")) %>% ggplot(aes(min, value, color = Name)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent) + geom_vline(xintercept = .5, linetype = "dashed") + geom_vline(xintercept = .8, linetype = "dashed") + facet_wrap(~ name, scales = "free_y") + theme_minimal() + theme(legend.position = "top")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.