Blocking records for deduplication
In blocking: Various Blocking Methods for Entity Resolution

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)

Setup

Read required packages.

library(blocking)
library(data.table)

Read the RLdata500 data (taken from the RecordLinkage package).

data(RLdata500)
head(RLdata500)

This dataset contains r nrow(RLdata500) rows with r NROW(unique(RLdata500$ent_id)) entities.

Blocking for deduplication

Now we create a new column that concatenates the information in each row.

RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)

In the next step we use the newly created column in the blocking function. If we specify verbose, we get information about the progress.

df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)

Results are as follows:

based on rnndescent we have created r NROW(unique(df_blocks$result$block)) blocks,
it was based on r NROW(unique(df_blocks$colnames)) columns (2 character shingles),
we have 45 blocks of 2 elements, 33 blocks of 3 elements, ..., 1 block of 17 elements.

df_blocks

Structure of the object is as follows:

result -- a data.table with identifiers and block IDs,
method -- the method used,
deduplication -- whether deduplication was applied,
representation -- whether shingles or vectors were used,
metrics -- standard metrics and based on the igraph::compare methods for comparing graphs (here NULL),
confusion -- confusion matrix (here NULL),
colnames -- column names used for the comparison,
graph -- an igraph object mainly for visualisation.

str(df_blocks,1)

Plot connections.

plot(df_blocks$graph, vertex.size=1, vertex.label = NA)

The resulting data.table has four columns:

x -- reference dataset (i.e. RLdata500) -- this may not contain all units of RLdata500,
y - query (each row of RLdata500) -- this may not contain all units of RLdata500,
block -- the block ID,
dist -- distance between objects.

head(df_blocks$result)

Create long data.table with information on blocks and units from original dataset.

df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)

We add block information to the final dataset.

RLdata500[df_block_melted_rec_block, on = "rec_id", block_id := i.block]
head(RLdata500)

We can check in how many blocks the same entities (ent_id) are observed. In our example, all the same entities are in the same blocks.

RLdata500[, .(uniq_blocks = uniqueN(block_id)), .(ent_id)][, .N, uniq_blocks]

We can visualise the distances between units stored in the df_blocks$result data set. Clearly we have a mixture of two groups: matches (close to 0) and non-matches (close to 1).

hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
     main = "Distances calculated between units")

Finally, we can visualise the result based on the information whether block contains matches or not.

df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]

plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8), 
     main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))