Introduction to fedmatch
In fedmatch: Fast, Flexible, and User-Friendly Record Linkage Methods

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Background

With an ever-expanding set of financial data providers, each covering many of the same firms and entities, matching data sets that do not share common identifiers has never been more important.fedmatch is a set of tools for performing record linkage between two data sets. It allows for a variety of different matching techniques, letting the user build a matching algorithm for their specific application. Although fedmatch was designed with economic data in mind (i.e. loans or companies), it is very flexible, so it can be used for any matching problem. With fedmatch, a researcher or analyst can quickly go from having 0 matches between two datasets to having many. With more time and care, they can use more advanced techniques to pull out even more matches.

Fedmatch has many features, including:

Tools to clean strings (see below)
Matching on names (see below, and the fuzzy matching vignette)
A new method of string comparison, which we dub "Weighted Jaccard" (see the fuzzy matching vignette)
Matching using probabilistic methods and logit models (see the multivar match vignette)
Matching using many different methods in sequence, pulling matches out as you go along (see the tier match vignette)

This vignette will explain the basics of fedmatch, including the merge_plus function and the clean_strings function. From there, other vignettes go further into the details of the different matching types.

library(fedmatch)
library(data.table)

Name cleaning

Before diving into the matching code, we'll first go over the clean_strings function that can help standardize company names across data sets.

A basic example of clean strings looks like this:

raw_names <- c("Hamlin, Hamlin, McGill", "Schweibert & Cokely ", "Wexler McGill, LLC",
               "Davis and Main, Inc.")
clean_names <- clean_strings(raw_names)
clean_names

Without any additional arguments, clean_strings does the following:

Make everything lowercase
Replace the special characters &, @, %, $ with their word equivalents
Remove all other special characters (e.g. commas, periods)
Convert tabs to spaces
Remove extra spaces

As described in the manual for clean_strings, one can specify further arguments to remove words or try different replacements. Fedmatch comes with a set of words that are commonly used for this, but you can use whatever you'd like. (You can also use word_frequency to look for common words in your data.)

fedmatch::corporate_words[1:5]
scrubbed_names <- clean_strings(raw_names, common_words = fedmatch::corporate_words)
scrubbed_names

Through string cleaning, we can make it so that even if two different databases record names differently (e.g. "Hamlin Hamlin McGill INC" vs "Hamlin Hamlin McGill Incorporated"), we will still count these records as a match.

Basics: `merge_plus`

The workhorse of fedmatch is merge_plus. merge_plus is an extremely flexible function that can perform several different types of matches: exact, fuzzy, and multivar.

Exact matching

Here are the example datasets that come with fedmatch:

fedmatch::corp_data1
fedmatch::corp_data2

The most basic way to use merge_plus is by simply making it equivalent to base::merge.

basic_merge <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "exact", 
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")

This code will run merge on the "Company" and "Name" variables, and return cases where the two have an exact match. The only differences between this and base::merge are

merge_plus requires data1 and data2 each to have a "unique key" that can be used to identify an observation.
merge_plus returns a list.

Let's take a look at each of the elements of the list returned by merge_plus. These will always be the same, no matter which match_type you select in merge_plus.

The first item is the matches themselves. This is a data.table with one row for each matching observation, along with all variables present in each data set.

print(basic_merge$matches)

The next item is matches_filter. In this example, it's empty, because we didn't supply the argument filter. If we did supply filter (which can either be a function that filters, or a numeric cutoff for a matchscore (more on this later)), we would see a subsample of the matches dataset.

print(basic_merge$matches_filter)

Next in the list is data1_nomatch and data2_nomatch, which return the rows that were not matched from the datasets.

print(basic_merge$data1_nomatch)
print(basic_merge$data2_nomatch)

Finally, there is match_evaluation, which is a data.table that summarizes how well the match worked. It shows the number of matches in each dataset broken down by tier (more on tiers later), along with the percent matched.

print(basic_merge$match_evaluation)

Fuzzy matching

We can also use merge_plus to perform "fuzzy" matches. A fuzzy match uses a string distance algorithm to compute the distance between one string and a set of other strings, then picks the closest string that's over a certain threshold. fedmatch uses stringdist::amatch to execute these matches, and you can read more about string distances in the stringdist package documentation.

Here is an example of how this is implemented in merge_plus.

fuzzy_result <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", fuzzy_settings = build_fuzzy_settings(nthread = 1),
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)

We can see that we picked up an additional match here: "Bershire Hataway" and "Bershire Hathaway." These are off by 1 character, so the exact match didn't pick them up, but the fuzzy match did. We can also tweak the fuzzy match settings with the argument fuzzy_settings. This is a list that will be passed to stringdist::amatch.

fuzzy_result <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", 
                          fuzzy_settings = build_fuzzy_settings(maxDist = .5),
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)

So, cranking up the maxDist (maximum distance between strings, a threshold for determining matches) gave us a bunch more matches. Note that we return multiple matches per the same unique key sometimes, for example Ford Motor got matched to General Motors and Ford Motor Company. There are many tweaks that one can make via fuzzy_settings, and these change the match behavior significantly. It is worth exploring various options to see which make the most sense for your specific application. See the fuzzy matching vignette for more details, including a new method of string comparison that we call a "Weighted Jaccard" comparison.

Multivar matching

The final setting for match_type in merge_plus is a "multivariable match", or "multivar" for short. This match is complex, and may take some playing around with the code to fully understand how it works. We'll just go over the basic usage here. The idea behind the multivariable match is to use several variables from each dataset to execute a match, rather than just using the name of an entity.

One way is to take the set of variables (say, company name, state, and earnings), compare them with some numeric metric, and then perform a linear combination of those metrics to arrive at a final score. Then, you can compare each observation in one dataset to each other observation in the other dataset to pick the match with the highest score.

The other way is similar, but instead of a linear combination of scores, you can use a logit model. In this method, you create a hand-verified match set between your two datasets, then use a logit model to estimate how much each variable contributes to determining a match.

Here's an example of the first method, the linear combination:

# for simplicity's sake, rename columns in corp_data2
data.table::setnames(corp_data2, c("Name", "country"), c("Company", "Country"))
multivar_linear_result <- merge_plus(corp_data1, corp_data2, 
                                     match_type = "multivar",
                                     by = c("Country", "Company"), 
                                     unique_key_1 = "unique_key_1",
                                     suffixes = c("_1", "_2"),
                                     unique_key_2 = "unique_key_2",

                                     multivar_settings = build_multivar_settings(compare_type = c("indicator", "stringdist"),
                                                               wgts = c(.5, .5),
                                                              top = 1))
multivar_linear_result$matches

To specify this type of match, we put in the match_type as "multivar," and then we specified how we wanted the match to run by passing the list multivar_settings. Each element of this list is a separate argument to go into multivar_match. The compare_type argument tells the multivar how to compare each variable in the by argument. Because "Country" is a binary variable, we specify "indicator", and because "Company" is a string variable, we specify "stringdist." You can see a full list of options for comparison in the multivar_match documentation, or equivalently, the merge_plus score_settings documentation. (merge_plus has an option to compute matchscores post-hoc, as a method of evaluation.)

Next, here's an example the second method, using a logit model. First, we'll set up a fake training table. Normally, one would construct a human-verified match set. Here, I just create a table where the first half are matches, the second half are a mix of matches and not, and then the two comparison variables are biased to be more of a match in the first half of the sample. This is just a way to ensure that our logit model gives us positive coefficients, so that our example makes a little more sense.

set.seed(111)
training_table <- data.table::data.table(match = c(rep(1, 5e4), sample(c(0,1 ), 5e4, replace = TRUE)),
                                Company_compare = seq(1, 0.00001, -.00001),
                                Country_compare = c(rep(1, 5e4), sample(c(1, 0), 5e4, replace = TRUE)))
# training_table
logit_model <- glm(match ~ Company_compare + Country_compare, family = "binomial",
                   data = training_table)
summary(logit_model)

Then, we plug our logit model into the multivar_settings. The code will then use our trained logit model on the variables we specified. Note that the name of the columns in the training set must match the name of the variables in the match datasets, with "_compare" at the end.

result <- merge_plus(corp_data1, corp_data2, by = c("Country", "Company"), unique_key_1 = "unique_key_1",
                        unique_key_2 = "unique_key_2", 
                     match_type = "multivar",
                     multivar_settings = list(logit = logit_model, compare_type = c("indicator", "stringdist"),
                        wgts = NULL),
                        suffixes = c("_1", "_2"))
result$matches

Note the last few columns in the data.table: we see the comparison metrics, just like in the linear combination version of multivar_match. But, note that instead of computing a 50/50 linear combination like before, we are now computing a matchscore as the fitted probability of a match based on our logit model. In this toy example, the coefficients are a little strange because of the random data we fed in. But, we see the behavior we'd expect: a higher company name comparison and a country match gives us a higher matchscore.

Summary and next steps

We've covered the several different types of matching with fedmatch: exact matching, fuzzy matching, and multivar matching. Each match is useful in its own right, and they become even more useful when combined. That's where the next step comes in: tier matching with the function tier_match. See the vignette for tier match for more details.