pair_ann: Integration with the reclin2 package

View source: R/reclin2_pair_ann.R

pair_annR Documentation

Integration with the reclin2 package

Description

Function for the integration with the reclin2 package. The function is based on pair_minsim and reuses some of its source code.

Usage

pair_ann(
  x,
  y = NULL,
  on,
  deduplication = TRUE,
  keep_block = TRUE,
  add_xy = TRUE,
  ...
)

Arguments

x

reference data (a data.frame or a data.table),

y

query data (a data.frame or a data.table, default NULL),

on

a character with column name or a character vector with column names for the ANN search,

deduplication

whether deduplication should be performed (default TRUE),

keep_block

whether to keep the block variable in the set,

add_xy

whether to add x and y,

...

arguments passed to blocking function.

Value

Returns a data.table with two columns .x and .y. Columns .x and .y are row numbers from data.frames x and y respectively. Returned data.table is also of a class pairs which allows for integration with the compare_pairs function.

Author(s)

Maciej Beręsewicz

Examples


# example using two datasets from reclin2


if (requireNamespace("reclin2", quietly = TRUE)) {

library(reclin2)
data("linkexample1", "linkexample2", package = "reclin2")

linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)

# pairing records from linkexample2 to linkexample1 based on txt column

pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")
}


blocking documentation built on June 18, 2025, 9:16 a.m.