This R package is designed to block records for data deduplication and
record linkage (also known as entity resolution) using approximate
nearest neighbor algorithms
(ANN) and graphs
(via the igraph
package).
It supports the following R packages that bind to specific ANN algorithms:
mlpack::lsh
and mlpack::knn
).The package can be used with the
reclin2 package via the
blocking::pair_ann
function.
Install the stable version from CRAN:
install.packages("blocking")
You can also install the development version from GitHub:
# install.packages("pak") # uncomment if needed
pak::pkg_install("ncn-foreigners/blocking")
Load packages for the examples:
library(blocking)
library(reclin2)
#> Loading required package: data.table
Generate simple data with three groups (df_example
) and reference data
(df_base
):
df_example <- data.frame(txt = c(
"jankowalski",
"kowalskijan",
"kowalskimjan",
"kowaljan",
"montypython",
"pythonmonty",
"cyrkmontypython",
"monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan", "other"))
df_example
#> txt
#> 1 jankowalski
#> 2 kowalskijan
#> 3 kowalskimjan
#> 4 kowaljan
#> 5 montypython
#> 6 pythonmonty
#> 7 cyrkmontypython
#> 8 monty
df_base
#> txt
#> 1 montypython
#> 2 kowalskijan
#> 3 other
Deduplication using the blocking
function. Output contains
information:
nnd
refers to the NN descent algorithm),text2vec
package (here 28),blocking_result <- blocking(x = df_example$txt)
blocking_result
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4
#> 2
Table with blocking results contains:
blocking_result$result
#> x y block dist
#> <int> <int> <num> <num>
#> 1: 1 2 1 0.10000002
#> 2: 2 3 1 0.14188367
#> 3: 2 4 1 0.28286284
#> 4: 5 6 2 0.08333331
#> 5: 5 7 2 0.13397455
#> 6: 5 8 2 0.27831215
Deduplication using the pair_ann
function for integration with the
reclin2
package. Use the pipeline with the reclin2
package:
pair_ann(x = df_example, on = "txt") |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
#> Total number of pairs: 8 pairs
#>
#> Key: <.y>
#> .y .x txt.x txt.y
#> <int> <int> <char> <char>
#> 1: 2 1 jankowalski kowalskijan
#> 2: 3 1 jankowalski kowalskimjan
#> 3: 3 2 kowalskijan kowalskimjan
#> 4: 4 1 jankowalski kowaljan
#> 5: 4 2 kowalskijan kowaljan
#> 6: 6 5 montypython pythonmonty
#> 7: 7 5 montypython cyrkmontypython
#> 8: 8 5 montypython monty
Linking records using the same function where df_base
is the
“register” and df_example
is the reference data:
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
#> Total number of pairs: 8 pairs
#>
#> Key: <.y>
#> .y .x txt.x txt.y
#> <int> <int> <char> <char>
#> 1: 1 2 kowalskijan jankowalski
#> 2: 2 2 kowalskijan kowalskijan
#> 3: 3 2 kowalskijan kowalskimjan
#> 4: 4 2 kowalskijan kowaljan
#> 5: 5 1 montypython montypython
#> 6: 6 1 montypython pythonmonty
#> 7: 7 1 montypython cyrkmontypython
#> 8: 8 1 montypython monty
See section Data Integration (Statistical Matching and Record Linkage)
in the Official Statistics Task
View.
Packages that allow blocking:
pair_blocking
, pair_minsim
functions,blockData
function.Other:
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.