In this vignette, we demonstrate how to use dblinkR to perform Bayesian entity resolution. dblinkR implements a generative Bayesian model that jointly performs blocking and entity resolution of multiple databases as outlined in the following paper:
Marchant, N. G., Steorts, R. C., Kaplan, A., Rubinstein, B. I. P., & Elazar, D. N. (2019). d-blink: Distributed End-to-End Bayesian Entity Resolution. arXiv preprint arXiv:1909.06039.
We will be using a dataset called RLdata500
that was formerly available in
the RecordLinkage R package (now deprecated). For convenience, we have
included the RLdata500
dataset in dblinkR
.
An outline of the vignette is as follows:
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=7, fig.height=5 )
We begin by loading the required packages. Note that the dblinkR
package
must be loaded after sparklyr
.
# For running dblink library(sparklyr) library(dblinkR) # For generating trace plots library(stringr) library(dplyr) library(tidyr) library(ggplot2) # For generating credible interval plots library(tidybayes)
In this vignette, we will run dblink
on a local instance of Spark with 2
cores. This is sufficient for testing purposes, however for larger data sets,
we recommend connecting to a Spark deployment. The sparklyr
documentation contains information
about connecting to Spark deployments.
sc <- spark_connect(master = "local[2]", version = "2.4.3") spark_context(sc) %>% invoke("setLogLevel", "WARN")
dblink
requires a location on disk (includes HDFS) to save diagnostics,
posterior samples, and the state of the Markov chain. We refer to this location
as the project path, and set it to the current working directory below.
In addition, we must also specify a location on disk to save Spark checkpoints.
projectPath <- paste0(getwd(), "/") # working directory spark_set_checkpoint_dir(sc, paste0(projectPath, "checkpoints/"))
Now, that we have loaded the required packages and connected to Spark, we can proceed with the rest of the tutorial!
RLdata500 is a synthetic dataset that contains 500 personal records with 10 percent duplication. Thus, there are 450 unique individuals represented in the data and 50 duplicate records. Each record contains the individual's full name and date of birth, however these attributes may be distorted. We add an additional column for record identifiers, which are used to specify the linkage structure later on.
records <- RLdata500 records['rec_id'] <- seq_len(nrow(records)) records <- lapply(records, as.character) %>% as.data.frame(stringsAsFactors = FALSE) # inspect RLdata500 head(RLdata500)
Here, we specify the parameters for the dblink
model.
We treat the name-related attributes as string-type with a Levenshtein similarity function, and the date-related attributes as categorical (with a constant similarity function). We also place a Beta prior on the distortion probability for each attribute.
distortionPrior <- BetaRV(1, 50) levSim <- LevenshteinSimFn(threshold = 7, maxSimilarity = 10) attributeSpecs <- list( fname_c1 = Attribute(levSim, distortionPrior), lname_c1 = Attribute(levSim, distortionPrior), by = CategoricalAttribute(distortionPrior), bm = CategoricalAttribute(distortionPrior), bd = CategoricalAttribute(distortionPrior) )
The Beta prior on the distortion probabilities favours low distortion, as shown below.
tibble(prob = seq(from=0, to=1, by=0.01)) %>% mutate(density = dbeta(prob, distortionPrior@shape1, distortionPrior@shape2)) %>% ggplot(aes(x = prob, y = density)) + geom_line() + labs(title = "Prior distribution on the distortion probabilities", x = "Distortion probability", y = "Density")
We can also specify the size of the latent entity population. A larger
population size favours under-linkage, while a smaller population
size favours over-linkge. By default, the population size is set to the
number of records in the data. This is specified by setting the population
size to NULL
:
populationSize <- NULL
Finally, we specify the partitioner which is used for blocking the entities
to improve scalability. Since the data is small in this case, we can switch
partitioning off by setting the partitioner to NULL
. For larger data sets,
we recommend using the KDTreePartitioner
(please refer to the documentation).
partitioner <- NULL
In the previous section, we specified the model parameters. Now, we are ready to perform inference. The results will be saved in the project path.
state <- initializeState(sc, records, attributeSpecs, recIdColname = 'rec_id', partitioner = partitioner, populationSize = populationSize, randomSeed = 1, maxClusterSize = 10L) result <- runInference(state, projectPath, sampleSize = 1000, burninInterval = 5000, thinningInterval = 10)
To load results from a previous run, we can use the following function.
#result <- loadResult(sc, projectPath)
It's important to review convergence and mixing diagnostics when performing
inference using Markov chain Monte Carlo (MCMC). Various diagnostics are saved
in the diagnostics.csv
file located in the project path. We can use the
function below to load these into a local tibble.
diagnostics <- loadDiagnostics(sc, projectPath)
Below we provide trace plots of various summary statistics.
ggplot(diagnostics, aes(x=iteration, y=numObservedEntities)) + geom_line() + labs(title = 'Trace plot: # entity clusters', x = 'Iteration', y = '# clusters') diagnostics %>% select(iteration, starts_with("aggDist")) %>% gather(attribute, numDistortions, starts_with("aggDist")) %>% mutate(attribute = str_match(attribute, "^aggDist(.+)")[,2], numDistortions = numDistortions / nrow(records)) %>% ggplot(aes(x=iteration, y=numDistortions)) + geom_line(aes(colour = attribute)) + labs(title = 'Trace plot: attribute distortion', x = 'Iteration', y = '% distorted', colour = 'Attribute') ggplot(diagnostics, aes(x=iteration, y=logLikelihood)) + geom_line() + labs(title = 'Trace plot: log-likelihood (unnormalized)', x = 'Iteration', y = 'log-likelihood')
Additional diagnostic statistics can be computed from the samples of the linkage structure, which we refer to as the linkage chain.
linkageChain <- loadLinkageChain(sc, projectPath) clustSizeDist <- clusterSizeDistribution(linkageChain) ggplot(clustSizeDist, aes(x=iteration, y=frequency)) + geom_line(aes(colour = clusterSize)) + labs(title = 'Trace plot: cluster size distribution', x = 'Iteration', y = 'Frequency', colour = 'Cluster size') partSizes <- partitionSizes(linkageChain) ggplot(partSizes, aes(x=iteration, y=size)) + geom_line(aes(colour = partitionId)) + labs(title = 'Trace plot: partition sizes', x = 'Iteration', y = 'Size', colour = 'Partition')
In this section, we evaluate the predictions of our model and compare against the ground truth (provided with the synthetic data).
We compute pairwise precision and recall using a point estimate of the linkage structure. We also inspect the posterior distribution over two important summary statistics:
We compute a point estimate of the posterior linkage structure/clustering using the shared most probable maximal matching sets method (Steorts et al., 2016). This method creates a globally consistent estimate that obeys transitivity constraints.
Note that we must use the collect
function to retrieve the data from Spark
into a local tibble.
predClusters <- sharedMostProbableClusters(linkageChain) %>% collect()
Next we assess the quality of the predicted linkage structure by
computing pairwise precision and recall using functions from the
clevr
package.
library(clevr) predMatches <- clusters_to_pairs(predClusters) trueMatches <- membership_to_pairs(identity.RLdata500, elem_ids = records$rec_id) numRecords <- nrow(records) measures <- eval_report_pairs(trueMatches, predMatches, numRecords*(numRecords - 1)/2) print(measures)
We can use the saved posterior samples to infer unknown model parameters.
First, we consider the number of unique entities present in the data.
This summary statistic is stored in diagnostics$numObservedEntities
.
Below we compute a point estimate based on the median, along with a 95\%
highest density interval.
hdiNumEntities <- median_hdi(diagnostics$numObservedEntities) cat("The predicted number of unique entities in RLdata500 is ", hdiNumEntities$y, ".\n", sep="") cat("A 95% HDI is ", sprintf("[%s,%s]", hdiNumEntities$ymin, hdiNumEntities$ymax), ".\n", sep="")
We can also visualize the estimated posterior distribution, while comparing to the ground truth (red), posterior median (blue) and 95\% highest density interval (green).
ggplot(diagnostics) + geom_histogram(aes(x=numObservedEntities), binwidth = 1) + geom_vline(aes(xintercept = hdiNumEntities$y), color='blue') + geom_vline(aes(xintercept = 450), color='red') + geom_vline(aes(xintercept = hdiNumEntities$ymin), color = 'green') + geom_vline(aes(xintercept = hdiNumEntities$ymax), color = 'green') + labs(x = "# of entities", y = "Frequency", title = "Posterior number of observed entities")
Second, we consider the distortion level observed in each attribute, averaged
across all of the records. This information is also stored in
diagnostics
. Below we plot a point estimate of the distortion level for each
attribute based on the median, along with a 95\% highest density interval.
diagnostics %>% select(iteration, starts_with("aggDist")) %>% gather(attribute, numDistortions, starts_with("aggDist")) %>% transmute(attribute = str_match(attribute, "^aggDist(.+)")[,2], percDistorted = numDistortions / nrow(records) * 100) %>% group_by(attribute) %>% ggplot(aes(y = attribute, x = percDistorted)) + stat_pointinterval(point_interval = median_hdi, .width = 0.95) + labs(y = "Attribute", x = "Distortion level (%)", title = "Posterior distortion level by attribute")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.