fastLink: Fast Probabilistic Record Linkage Build Status CRAN Version CRAN downloads

Authors:

For a detailed description of the method see:

Applications of the method:

Technical reports:

Data:

Installation Instructions

fastLink is available on CRAN and can be installed using:

install.packages("fastLink")

You can also install the most recent development version of fastLink using the devtools package. First you have to install devtools using the following code. Note that you only have to do this once:

if(!require(devtools)) install.packages("devtools")

Then, load devtools and use the function install_github() to install fastLink:

library(devtools)
install_github("kosukeimai/fastLink",dependencies=TRUE)

Simple usage example

The linkage algorithm can be run either using the fastLink() wrapper, which runs the algorithm from start to finish, or step-by-step. We will outline the workflow from start to finish using both examples. In both examples, we have two dataframes called dfA and dfB that we want to merge together, and they have seven commonly named fields:

Running the algorithm using the fastLink() wrapper

The fastLink wrapper runs the entire algorithm from start to finish, as seen below:

## Load the package and data
library(fastLink)
data(samplematch)

matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname")
)

Other arguments that can be provided include:

The output from fastLink() when estimate.only = FALSE will be a list of length 4 with two entries:

When estimate.only = TRUE, fastLink() outputs the EM object.

The datasets can then be subsetted down to the matches as follows:

dfA.match <- dfA[matches.out$matches$inds.a,]
dfB.match <- dfB[matches.out$matches$inds.b,]

or using the getMatches() function:

matched_dfs <- getMatches(
  dfA = dfA, dfB = dfB, 
  fl.out = matches.out, threshold.match = 0.85
)

We can also examine the EM object:

matches.out$EM

which is a list of parameter estimates for different fields. These fields are:

Lastly, we can summarize the accuracy of the match using the summary() function:

summary(matches.out)

where each column gives the match count, match rate, false discovery rate (FDR) and false negative rate (FNR) under different cutoffs for matches based on the posterior probability of a match. Other arguments include:

Preprocessing Matches via Blocking

In order to reduce the number of pairwise comparisons that need to be conducted, researchers will often block similar observations from dataset A and dataset B together so that comparisons are only made between these maximally similar groups. Here, we implement a form of this clustering that uses word embedding, a common preprocessing method for textual data, to form maximally similar groups.

In \fastLink, the function blockData() can block two data sets using a single variable or combinations of variables using several different blocking techniques. The basic functionality is similar to that of fastLink(), where the analyst inputs two data sets and a vector of variable names that they want to block on. A simple example follows, where we are blocking the two sample data sets by gender:

fl_out <- fastLink(dfA, dfB,
                   varnames = c("firstname", "middlename", "lastname",
                                "housenum", "streetname", "city", "birthyear"))

gender_match <- sample(c("M", "F"), nrow(fl_out$matches), replace = TRUE)

gender_a <- rep(NA, nrow(dfA))
gender_a[fl_out$matches$inds.a] <- gender_match

gender_b <- rep(NA, nrow(dfB))
gender_b[fl_out$matches$inds.b] <- gender_match

gender_a[is.na(gender_a)] <- sample(c("M", "F"), sum(is.na(gender_a)), replace = TRUE)
gender_b[is.na(gender_b)] <- sample(c("M", "F"), sum(is.na(gender_b)), replace = TRUE)

dfA$gender <- gender_a
dfB$gender <- gender_b
blockgender_out <- blockData(dfA, dfB, varnames = "gender")
names(blockgender_out)

In its simplest usage, \texttt{blockData()} takes two data sets and a single variable name for the \texttt{varnames} argument, and it returns the indices of the member observations for each block. Data sets can then be subsetted as follows and the match can then be run within each block separately:

## Subset dfA into blocks
dfA_block1 <- dfA[blockgender_out$block.1$dfA.inds,]
dfA_block2 <- dfA[blockgender_out$block.2$dfA.inds,]

## Subset dfB into blocks
dfB_block1 <- dfB[blockgender_out$block.1$dfB.inds,]
dfB_block2 <- dfB[blockgender_out$block.2$dfB.inds,]

## Run fastLink on each
fl_out_block1 <- fastLink(
  dfA_block1, dfB_block1,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
)
fl_out_block2 <- fastLink(
  dfA_block2, dfB_block2,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
)

blockData() also implements other methods of blocking other than exact blocking. Analysts commonly use window blocking for numeric variables, where a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within $\pm K$ of the value of the same variable in dataset A. The value of $K$ is the size of the window --- for instance, if we wanted to compare observations where birth year is within $\pm 1$ year, the window size is 1. Below, we block dfA and dfB on gender and birth year, using exact blocking on gender and window blocking with a window size of 1 on birth year:

## Exact block on gender, window block (+/- 1 year) on birth year
blockdata_out <- blockData(dfA, dfB, varnames = c("gender", "birthyear"),
                           window.block = "birthyear", window.size = 1)

blockData() also allows users to block variables using k-means clustering, so that similar values of string and numeric variables are blocked together. When applying k-means blocking to string variables such as name, the algorithm orders observations so that alphabetically close names are grouped together in a block. In the following example, we block dfA and dfB on gender and first name, again using exact blocking on gender and k-means blocking on first name while specifying 2 clusters for the k-means algorithm:

## Exact block on gender, k-means block on first name with 2 clusters
blockdata_out <- blockData(dfA, dfB, varnames = c("gender", "firstname"),
                           kmeans.block = "firstname", nclusters = 2)

Using Auxiliary Information to Inform fastLink

The fastLink algorithm also includes several ways to incorporate auxiliary information on migration behavior to inform the matching of data sets over time. Auxiliary information is incorporated into the estimation as priors on two parameters of the model:

The functions calcMoversPriors() can be used to calculate estimates for the corresponding prior distributions using the IRS Statistics of Income Migration Data.

Below, we show an example where we incorporate the auxiliary moving information for California into our estimates. First, we use calcMoversPriors() to estimate optimal parameter values for the priors:

priors.out <- calcMoversPriors(geo.a = "CA", geo.b = "CA", year.start = 2014, year.end = 2015)
names(priors.out)

where the lambda.prior entry is the estimate of the match rate, while pi.prior is the estimate of the in-state movers rate.

The calcMoversPriors() function accepts the following functions:

Incorporating Auxiliary Information with fastLink() Wrapper

We can re-run the full match above while incorporating auxiliary information as follows:

## Reasonable prior estimates for this dataset
priors.out <- list(lambda.prior = 50/(nrow(dfA) * nrow(dfB)), pi.prior = 0.02)

matches.out.aux <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  priors.obj = priors.out, 
  w.lambda = .5, w.pi = .5, 
  address.field = "streetname"
)

where priors.obj is an input for the the optimal prior parameters. This can be calculated by calcMoversPriors(), or can be provided by the user as a list with two entries named lambda.prior and pi.prior. w.lambda and w.pi are user-specified weights between 0 and 1 indicating the weighting between the MLE estimate and the prior, where a weight of 0 indicates no weight being placed on the prior. address_field is a vector of booleans of the same length as varnames, where TRUE indicates an address-related field used for matching.

Aggregating Multiple Matches Together

Often, we run several different matches for a single data set - for instance, when blocking by gender or by some other criterion to reduce the number of pairwise comparisons. Here, we walk through how to aggregate those multiple matches into a single summary. Here, we run fastLink() on the subsets of data defined by blocking on gender in the previous section:

## Run fastLink on each
link.1 <- fastLink(
  dfA_block1, dfB_block1,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
)
link.2 <- fastLink(
  dfA_block2, dfB_block2,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
)

To aggregate the two matches into a single summary, we use the aggregateEM() function as follows:

agg.out <- aggregateEM(em.list = list(link.1, link.2))

aggregateEM() accepts two arguments:

We can then summarize the aggregated output as done previously:

summary(agg.out)

If we assume that the first fastLink run was for a within-geography match and the second was an across-geography match, the call to aggregateEM() would be:

agg.out <- aggregateEM(em.list = list(link.1, link.2), within.geo = c(TRUE, FALSE))
summary(agg.out)

Random Sampling with fastLink

The probabilistic modeling framework of fastLink is especially flexible in that it allows us to run the matching algorithm on a random smaller subset of data to be matched, and then apply those estimates to the full sample of data. This may be desired, for example, when using blocking along with a prior. We may want to block in order to reduce the number of pairwise comparisons, but may also be uncomfortable making the assumption that the same prior applies to all blocks uniformly. Random sampling allows us to run the EM algorithm with priors on a random sample from the full dataset, and the estimates can then be applied to each block separately to get matches for the entire dataset.

This functionality is incorporated into the fastLink() wrapper, which we show in the following example:

## Take 30% random samples of dfA and dfB
dfA.s <- dfA[sample(1:nrow(dfA), nrow(dfA) * .3),]
dfB.s <- dfB[sample(1:nrow(dfB), nrow(dfB) * .3),]

## Run the algorithm on the random samples
rs.out <- fastLink(
  dfA = dfA.s, dfB = dfB.s, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  estimate.only = TRUE
)
class(rs.out)

## Apply to the whole dataset
fs.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  em.obj = rs.out
)
summary(fs.out)

In the first run of fastLink(), we specify estimate.only = TRUE, which runs the algorithm only through the EM estimation step and returns the EM object. In the second run of fastLink(), we provide the EM object from the first stage as an argument to em.obj. Then, using the parameter values calculated in the previous EM stage, we estimate posterior probabilities of belonging to the matched set for all matching patterns in the full dataset that were not present in the random sample.

Finding Duplicates within a Dataset via fastLink

The following lines of code represent an example on how to find duplicates withing a dataset via fastLink. As before, we use fastLink() (the wrapper function) to do the merge. fastLink() will automatically detect that two datasets are identical, and will use the probabilistic match algorithm to indicate duplicated entries in the dedupe.ids covariate in the returned data frame.

## Add duplicates
dfA <- rbind(dfA, dfA[sample(1:nrow(dfA), 10, replace = FALSE),])

## Run fastLink
fl_out_dedupe <- fastLink(
  dfA = dfA, dfB = dfA,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
)

## Run getMatches
dfA_dedupe <- getMatches(dfA = dfA, dfB = dfA, fl.out = fl_out_dedupe)

## Look at the IDs of the duplicates
names(table(dfA_dedupe$dedupe.ids)[table(dfA_dedupe$dedupe.ids) > 1])

## Show duplicated observation
dfA_dedupe[dfA_dedupe$dedupe.ids == 501,]


kosukeimai/fastLink documentation built on Nov. 17, 2023, 8:11 p.m.