knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Read required packages
library(blocking) library(data.table)
Read the example data from the tutorial on the reclin
package on the URos 2021 Conference. The data sets are from ESSnet on Data Integration as stated in the repository:
These totally fictional data sets are supposed to have captured details of persons up to the date 31 December 2011. Any years of birth captured as 2012 are therefore in error. Note that in the fictional Census data set, dates of birth between 27 March 2011 and 31 December 2011 are not necessarily in error. census: A fictional data set to represent some observations from a decennial Census cis: Fictional observations from Customer Information System, which is combined administrative data from the tax and benefit systems In the dataset census all records contain a person_id. For some of the records in cis the person_id is also available. This information can be used to evaluate the linkage (assuming these records from the cis are representable all records in the cis).
data(census) data(cis)
census
object has r nrow(census)
rows and r ncol(census)
columns,cis
object has r nrow(cis)
rows and r ncol(cis)
columns.Census data
head(census)
CIS data
head(cis)
We randomly select r as.integer(floor(nrow(census) / 2))
records from census
and r as.integer(floor(nrow(cis) / 2))
records from cis
.
set.seed(2024) census <- census[sample(nrow(census), floor(nrow(census) / 2)), ] cis <- cis[sample(nrow(cis), floor(nrow(cis) / 2)), ]
We need to create new columns that concatenate variables from pername1
to enumpc
.
census[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)] cis[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
blocking
packageThe goal of this exercise is to link units from the CIS dataset to the CENSUS dataset.
result1 <- blocking(x = census$txt, y = cis$txt, verbose = 1)
Distribution of distances for each pair.
hist(result1$result$dist, main = "Distribution of distances between pairs", xlab = "Distances")
Example pairs.
head(result1$result, n = 10)
Let's take a look at the first pair. Obviously there is a typo in the pername1
but all the other variables are the same, so it appears to be a match.
cbind(t(census[1, c(1:7, 9:10)]), t(cis[12088, 1:9]))
For some records, we have information about the correct linkage. We can use this information to evaluate our approach.
matches <- merge(x = census[, .(x=1:.N, person_id)], y = cis[, .(y = 1:.N, person_id)], by = "person_id") matches[, block:=1:.N] head(matches)
So in our example we have r nrow(matches)
pairs.
result2 <- blocking(x = census$txt, y = cis$txt, verbose = 1, true_blocks = matches[, .(x, y, block)])
Let's see how our approach handled this problem.
result2
It seems that the default parameters of the NND method result in an FNR of r sprintf("%.2f",result2$metrics["fnr"]*100)
%. We can see if decreasing the epsilon
parameter as suggested in the Nearest Neighbor Descent
vignette will help.
ann_control_pars <- controls_ann() ann_control_pars$nnd$epsilon <- 0.2 result3 <- blocking(x = census$txt, y = cis$txt, verbose = 1, true_blocks = matches[, .(x, y, block)], control_ann = ann_control_pars)
Changing the epsilon
search parameter from 0.1 to 0.2 decreased the FNR to r sprintf("%.2f",result3$metrics["fnr"]*100)
%.
result3
Finally, compare the NND and HNSW algorithm for this example.
result4 <- blocking(x = census$txt, y = cis$txt, verbose = 1, true_blocks = matches[, .(x, y, block)], ann = "hnsw")
It seems that the HNSW algorithm also performed with r sprintf("%.2f",result4$metrics["fnr"]*100)
% FNR.
result4
Finally, we can compare the results of two ANN algorithms. The overlap between neighbours is given by
c("no tuning" = mean(result2$result[order(y)]$x == result4$result[order(y)]$x)*100, "with tuning" = mean(result3$result[order(y)]$x == result4$result[order(y)]$x)*100)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.