cluster_pair_minsim: Generate pairs with a minimal similarity using multiple...
In reclin2: Record Linkage Toolkit

cluster_pair_minsim

R Documentation

Generate pairs with a minimal similarity using multiple processes

Description

Generates all combinations of records from x and y where the blocking variables are equal.

Usage

cluster_pair_minsim(
  cluster,
  x,
  y,
  on,
  minsim = 0,
  on_blocking = character(0),
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  keep_simsum = TRUE,
  deduplication = FALSE,
  name = "default"
)

Arguments

`cluster`	a cluster object as created by `makeCluster` from `parallel` or `makeCluster` from `snow`.
`x`	first `data.frame`
`y`	second `data.frame`. Ignored when `deduplication = TRUE`.
`on`	the variables defining the blocks or strata for which all pairs of `x` and `y` will be generated.
`minsim`	minimal similarity score.
`on_blocking`	variables for which the pairs have to match.
`comparators`	named list of functions with which the variables are compared. This function should accept two vectors. Function should either return a vector or a `data.table` with multiple columns.
`default_comparator`	variables for which no comparison function is defined using `comparators` is compares with the function `default_comparator`.
`keep_simsum`	add a variable `minsim` to the result with the similarity score of the pair.
`deduplication`	generate pairs from only `x`. Ignore `y`. This is usefull for deduplication of `x`.
`name`	the name of the resulting object to create locally on the different R processes.

Details

Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. pair_minsim will only keep pairs with a similarity score equal or larger than minsim. The similarity score is calculated by summing the results of the comparators for all variables of on.

x is split into length{cluster} parts which are distributed over the worker nodes. y is copied to each of the nodes. On the nodes then cluster_pair_minsim is called. The pairs are stored in the global object reclin_env on the nodes in the variable name. The pairs can then be further processes using functions such as compare_pairs, and tabulate_patterns. The function cluster_collect collects the pairs from each of the nodes.

Value

A object of type cluster_pairs which is a list containing the cluster and the name of the pairs object on the cluster nodes. For the pairs objects created on the nodes see the documentation of pair.

Examples

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

# Either address or postcode has to match to keep a pair
pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2, 
   on = c("postcode", "address"), minsim = 1)
stopCluster(cl)

reclin2 documentation built on May 29, 2024, 4:21 a.m.