knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, comment = "#>" )
Read required packages.
library(blocking) library(data.table)
Read the RLdata500
data (taken from the RecordLinkage package).
data(RLdata500) head(RLdata500)
This dataset contains r nrow(RLdata500)
rows with r NROW(unique(RLdata500$ent_id))
entities.
Now we create a new column that concatenates the information in each row.
RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))] head(RLdata500)
In the next step we use the newly created column in the blocking
function. If we specify verbose, we get information about the progress.
df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
Results are as follows:
rnndescent
we have created r NROW(unique(df_blocks$result$block))
blocks,r NROW(unique(df_blocks$colnames))
columns (2 character shingles),df_blocks
Structure of the object is as follows:
result
-- a data.table
with identifiers and block IDs,method
-- the method used,deduplication
-- whether deduplication was applied,representation
-- whether shingles or vectors were used,metrics
-- standard metrics and based on the igraph::compare
methods for comparing graphs (here NULL),confusion
-- confusion matrix (here NULL),colnames
-- column names used for the comparison,graph
-- an igraph
object mainly for visualisation.str(df_blocks,1)
Plot connections.
plot(df_blocks$graph, vertex.size=1, vertex.label = NA)
The resulting data.table
has four columns:
x
-- reference dataset (i.e. RLdata500
) -- this may not contain all units of RLdata500
,y
- query (each row of RLdata500
) -- this may not contain all units of RLdata500
,block
-- the block ID,dist
-- distance between objects.head(df_blocks$result)
Create long data.table
with information on blocks and units from original dataset.
df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist")) df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)]) head(df_block_melted_rec_block)
We add block information to the final dataset.
RLdata500[df_block_melted_rec_block, on = "rec_id", block_id := i.block] head(RLdata500)
We can check in how many blocks the same entities (ent_id
) are observed. In our example, all the same entities are in the same blocks.
RLdata500[, .(uniq_blocks = uniqueN(block_id)), .(ent_id)][, .N, uniq_blocks]
We can visualise the distances between units stored in the df_blocks$result
data set. Clearly we have a mixture of two groups: matches (close to 0) and non-matches (close to 1).
hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd", main = "Distances calculated between units")
Finally, we can visualise the result based on the information whether block contains matches or not.
df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id]) df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id] plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8), main = "Distribution of distances between\nclusters type (match=red, non-match=blue)") lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.