View source: R/cluster_pair_blocking.R
cluster_pair_blocking | R Documentation |
Generates all combinations of records from x
and y
where the
blocking variables are equal.
cluster_pair_blocking(
cluster,
x,
y,
on,
deduplication = FALSE,
name = "default"
)
cluster |
a cluster object as created by |
x |
first |
y |
second |
on |
the variables defining the blocks or strata for which
all pairs of |
deduplication |
generate pairs from only |
name |
the name of the resulting object to create locally on the different R processes. |
Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. Therefore, blocking is usually applied.
x
is split into length{cluster}
parts which are distributed
over the worker nodes. y
is copied to each of the nodes. On the nodes
then pair_blocking
is called. The pairs are stored in the global
object reclin_env
on the nodes in the variable name
. The pairs
can then be further processes using functions such as
compare_pairs
, and tabulate_patterns
. The function
cluster_collect
collects the pairs from each of the nodes.
A object of type cluster_pairs
which is a list
containing the
cluster and the name of the pairs object on the cluster nodes. For the pairs
objects created on the nodes see the documentation of pair
.
cluster_pair
and cluster_pair_minsim
are
other methods to generate pairs.
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)
pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
stopCluster(cl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.