View source: R/cluster_pair_minsim.R
cluster_pair_minsim | R Documentation |
Generates all combinations of records from x
and y
where the
blocking variables are equal.
cluster_pair_minsim(
cluster,
x,
y,
on,
minsim = 0,
on_blocking = character(0),
comparators = list(default_comparator),
default_comparator = cmp_identical(),
keep_simsum = TRUE,
deduplication = FALSE,
name = "default"
)
cluster |
a cluster object as created by |
x |
first |
y |
second |
on |
the variables defining the blocks or strata for which
all pairs of |
minsim |
minimal similarity score. |
on_blocking |
variables for which the pairs have to match. |
comparators |
named list of functions with which the variables are compared.
This function should accept two vectors. Function should either return a vector
or a |
default_comparator |
variables for which no comparison function is defined using
|
keep_simsum |
add a variable |
deduplication |
generate pairs from only |
name |
the name of the resulting object to create locally on the different R processes. |
Generating (all) pairs of the records of two data sets, is usually the first
step when linking the two data sets. However, this often results in a too
large number of records. pair_minsim
will only keep pairs with a
similarity score equal or larger than minsim
. The similarity score is
calculated by summing the results of the comparators for all variables
of on
.
x
is split into length{cluster}
parts which are distributed
over the worker nodes. y
is copied to each of the nodes. On the nodes
then cluster_pair_minsim
is called. The pairs are stored in the global
object reclin_env
on the nodes in the variable name
. The pairs
can then be further processes using functions such as
compare_pairs
, and tabulate_patterns
. The function
cluster_collect
collects the pairs from each of the nodes.
A object of type cluster_pairs
which is a list
containing the
cluster and the name of the pairs object on the cluster nodes. For the pairs
objects created on the nodes see the documentation of pair
.
cluster_pair
and cluster_pair_blocking
are
other methods to generate pairs.
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)
# Either address or postcode has to match to keep a pair
pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2,
on = c("postcode", "address"), minsim = 1)
stopCluster(cl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.