View source: R/reclin2_pair_ann.R
pair_ann | R Documentation |
Function for the integration with the reclin2 package. The function is based on pair_minsim and reuses some of its source code.
pair_ann(
x,
y = NULL,
on,
deduplication = TRUE,
keep_block = TRUE,
add_xy = TRUE,
...
)
x |
reference data (a data.frame or a data.table), |
y |
query data (a data.frame or a data.table, default NULL), |
on |
a character with column name or a character vector with column names for the ANN search, |
deduplication |
whether deduplication should be performed (default TRUE), |
keep_block |
whether to keep the block variable in the set, |
add_xy |
whether to add x and y, |
... |
arguments passed to blocking function. |
Returns a data.table with two columns .x
and .y
. Columns .x
and .y
are row numbers from data.frames x and y respectively.
Returned data.table
is also of a class pairs
which allows for integration with the compare_pairs function.
Maciej Beręsewicz
# example using two datasets from reclin2
if (requireNamespace("reclin2", quietly = TRUE)) {
library(reclin2)
data("linkexample1", "linkexample2", package = "reclin2")
linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)
# pairing records from linkexample2 to linkexample1 based on txt column
pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.