In cleanzr/dblinkR: R interface for dblink

In this vignette, we demonstrate how to use dblinkR to perform Bayesian entity resolution. dblinkR implements a generative Bayesian model that jointly performs blocking and entity resolution of multiple databases as outlined in the following paper:

Marchant, N. G., Steorts, R. C., Kaplan, A., Rubinstein, B. I. P., & Elazar, D. N. (2019). d-blink: Distributed End-to-End Bayesian Entity Resolution. arXiv preprint arXiv:1909.06039.

We will be using a dataset called RLdata500 that was formerly available in the RecordLinkage R package (now deprecated). For convenience, we have included the RLdata500 dataset in dblinkR.

An outline of the vignette is as follows:

We load the required packages and connect to Spark
We introduce the RLdata500 dataset
We demonstrate how to set the d-blink model parameters
We perform inference using MCMC
We examine MCMC diagnostics
We assess the posterior linkage structure and other model parameters

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=7, 
  fig.height=5
)

Loading Required Packages

We begin by loading the required packages. Note that the dblinkR package must be loaded after sparklyr.

# For running dblink
library(sparklyr)
library(dblinkR)

# For generating trace plots
library(stringr)
library(dplyr)
library(tidyr)
library(ggplot2)

# For generating credible interval plots
library(tidybayes)

Connecting to Spark

In this vignette, we will run dblink on a local instance of Spark with 2 cores. This is sufficient for testing purposes, however for larger data sets, we recommend connecting to a Spark deployment. The sparklyr documentation contains information about connecting to Spark deployments.

sc <- spark_connect(master = "local[2]", version = "2.4.3")
spark_context(sc) %>% invoke("setLogLevel", "WARN")

dblink requires a location on disk (includes HDFS) to save diagnostics, posterior samples, and the state of the Markov chain. We refer to this location as the project path, and set it to the current working directory below. In addition, we must also specify a location on disk to save Spark checkpoints.

projectPath <- paste0(getwd(), "/") # working directory
spark_set_checkpoint_dir(sc, paste0(projectPath, "checkpoints/"))

Now, that we have loaded the required packages and connected to Spark, we can proceed with the rest of the tutorial!

Understanding the RLdata500 dataset

RLdata500 is a synthetic dataset that contains 500 personal records with 10 percent duplication. Thus, there are 450 unique individuals represented in the data and 50 duplicate records. Each record contains the individual's full name and date of birth, however these attributes may be distorted. We add an additional column for record identifiers, which are used to specify the linkage structure later on.

records <- RLdata500
records['rec_id'] <- seq_len(nrow(records))
records <- lapply(records, as.character) %>% as.data.frame(stringsAsFactors = FALSE)
# inspect RLdata500
head(RLdata500)

dblink parameters

Here, we specify the parameters for the dblink model.

We treat the name-related attributes as string-type with a Levenshtein similarity function, and the date-related attributes as categorical (with a constant similarity function). We also place a Beta prior on the distortion probability for each attribute.

distortionPrior <- BetaRV(1, 50)
levSim <- LevenshteinSimFn(threshold = 7, maxSimilarity =  10)

attributeSpecs <- list(
  fname_c1 = Attribute(levSim, distortionPrior),
  lname_c1 = Attribute(levSim, distortionPrior),
  by = CategoricalAttribute(distortionPrior),
  bm = CategoricalAttribute(distortionPrior),
  bd = CategoricalAttribute(distortionPrior)
)

The Beta prior on the distortion probabilities favours low distortion, as shown below.

tibble(prob = seq(from=0, to=1, by=0.01)) %>%
  mutate(density = dbeta(prob, distortionPrior@shape1, distortionPrior@shape2)) %>%
  ggplot(aes(x = prob, y = density)) + geom_line() + 
  labs(title = "Prior distribution on the distortion probabilities", 
       x = "Distortion probability", y = "Density")

We can also specify the size of the latent entity population. A larger population size favours under-linkage, while a smaller population size favours over-linkge. By default, the population size is set to the number of records in the data. This is specified by setting the population size to NULL:

populationSize <- NULL

Finally, we specify the partitioner which is used for blocking the entities to improve scalability. Since the data is small in this case, we can switch partitioning off by setting the partitioner to NULL. For larger data sets, we recommend using the KDTreePartitioner (please refer to the documentation).

partitioner <- NULL

Running inference

In the previous section, we specified the model parameters. Now, we are ready to perform inference. The results will be saved in the project path.

state <- initializeState(sc, records, attributeSpecs, recIdColname = 'rec_id',
                         partitioner = partitioner, populationSize = populationSize,
                         randomSeed = 1, maxClusterSize = 10L)

result <- runInference(state, projectPath, sampleSize = 1000,
                       burninInterval = 5000, thinningInterval = 10)

To load results from a previous run, we can use the following function.

#result <- loadResult(sc, projectPath)

MCMC diagnostics

It's important to review convergence and mixing diagnostics when performing inference using Markov chain Monte Carlo (MCMC). Various diagnostics are saved in the diagnostics.csv file located in the project path. We can use the function below to load these into a local tibble.

diagnostics <- loadDiagnostics(sc, projectPath)

Below we provide trace plots of various summary statistics.

ggplot(diagnostics, aes(x=iteration, y=numObservedEntities)) + 
  geom_line() + 
  labs(title = 'Trace plot: # entity clusters', x = 'Iteration', 
       y = '# clusters')

diagnostics %>%
  select(iteration, starts_with("aggDist")) %>%
  gather(attribute, numDistortions, starts_with("aggDist")) %>%
  mutate(attribute = str_match(attribute, "^aggDist(.+)")[,2],
         numDistortions = numDistortions / nrow(records)) %>% 
  ggplot(aes(x=iteration, y=numDistortions)) +  
  geom_line(aes(colour = attribute)) + 
  labs(title = 'Trace plot: attribute distortion', x = 'Iteration', 
       y = '% distorted', colour = 'Attribute')

ggplot(diagnostics, aes(x=iteration, y=logLikelihood)) + 
  geom_line() + 
  labs(title = 'Trace plot: log-likelihood (unnormalized)', x = 'Iteration', 
       y = 'log-likelihood')

Additional diagnostic statistics can be computed from the samples of the linkage structure, which we refer to as the linkage chain.

linkageChain <- loadLinkageChain(sc, projectPath)

clustSizeDist <- clusterSizeDistribution(linkageChain)
ggplot(clustSizeDist, aes(x=iteration, y=frequency)) + 
  geom_line(aes(colour = clusterSize)) + 
  labs(title = 'Trace plot: cluster size distribution', x = 'Iteration', 
       y = 'Frequency', colour = 'Cluster size')

partSizes <- partitionSizes(linkageChain)
ggplot(partSizes, aes(x=iteration, y=size)) + 
  geom_line(aes(colour = partitionId)) + 
  labs(title = 'Trace plot: partition sizes', x = 'Iteration', y = 'Size', 
       colour = 'Partition')

Evaluation

In this section, we evaluate the predictions of our model and compare against the ground truth (provided with the synthetic data).

We compute pairwise precision and recall using a point estimate of the linkage structure. We also inspect the posterior distribution over two important summary statistics:

the number of unique individuals in the data, and
the distortion level of each attribute.

Linkage quality

We compute a point estimate of the posterior linkage structure/clustering using the shared most probable maximal matching sets method (Steorts et al., 2016). This method creates a globally consistent estimate that obeys transitivity constraints.

Note that we must use the collect function to retrieve the data from Spark into a local tibble.

predClusters <- sharedMostProbableClusters(linkageChain) %>% collect()

Next we assess the quality of the predicted linkage structure by computing pairwise precision and recall using functions from the clevr package.

library(clevr)
predMatches <- clusters_to_pairs(predClusters)
trueMatches <- membership_to_pairs(identity.RLdata500, elem_ids = records$rec_id)
numRecords <- nrow(records)
measures <- eval_report_pairs(trueMatches, predMatches, numRecords*(numRecords - 1)/2)
print(measures)

Posterior estimates

We can use the saved posterior samples to infer unknown model parameters. First, we consider the number of unique entities present in the data. This summary statistic is stored in diagnostics$numObservedEntities. Below we compute a point estimate based on the median, along with a 95\% highest density interval.

hdiNumEntities <- median_hdi(diagnostics$numObservedEntities)
cat("The predicted number of unique entities in RLdata500 is ", 
    hdiNumEntities$y, ".\n", sep="")
cat("A 95% HDI is ", 
    sprintf("[%s,%s]", hdiNumEntities$ymin, hdiNumEntities$ymax), ".\n", sep="")

We can also visualize the estimated posterior distribution, while comparing to the ground truth (red), posterior median (blue) and 95\% highest density interval (green).

ggplot(diagnostics) + 
  geom_histogram(aes(x=numObservedEntities), binwidth = 1) + 
  geom_vline(aes(xintercept = hdiNumEntities$y), color='blue') + 
  geom_vline(aes(xintercept = 450), color='red') + 
  geom_vline(aes(xintercept = hdiNumEntities$ymin), color = 'green') + 
  geom_vline(aes(xintercept = hdiNumEntities$ymax), color = 'green') + 
  labs(x = "# of entities", y = "Frequency", title = "Posterior number of observed entities")

Second, we consider the distortion level observed in each attribute, averaged across all of the records. This information is also stored in diagnostics. Below we plot a point estimate of the distortion level for each attribute based on the median, along with a 95\% highest density interval.

diagnostics %>%
  select(iteration, starts_with("aggDist")) %>%
  gather(attribute, numDistortions, starts_with("aggDist")) %>%
  transmute(attribute = str_match(attribute, "^aggDist(.+)")[,2],
            percDistorted = numDistortions / nrow(records) * 100) %>% 
  group_by(attribute) %>% 
  ggplot(aes(y = attribute, x = percDistorted)) + 
    stat_pointinterval(point_interval = median_hdi, .width = 0.95) + 
    labs(y = "Attribute", x = "Distortion level (%)", title = "Posterior distortion level by attribute")