cluster_pair: Generate all possible pairs using multiple processes

View source: R/cluster_pair.R

cluster_pairR Documentation

Generate all possible pairs using multiple processes

Description

Generates all combinations of records from x and y.

Usage

cluster_pair(cluster, x, y, deduplication = FALSE, name = "default")

Arguments

cluster

a cluster object as created by makeCluster from parallel or from the snow package.

x

first data.frame

y

second data.frame. Ignored when deduplication = TRUE.

deduplication

generate pairs from only x. Ignore y. This is usefull for deduplication of x.

name

the name of the resulting object to create locally on the different R processes.

Details

Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets.

x is split into length{cluster} parts which are distributed over the worker nodes. y is copied to each of the nodes. On the nodes then pair is called. The pairs are stored in the global object reclin_env on the nodes in the variable name. The pairs can then be further processes using functions such as compare_pairs, and tabulate_patterns. The function cluster_collect collects the pairs from each of the nodes.

Value

A object of type cluster_pairs which is a list containing the cluster and the name of the pairs object on the cluster nodes. For the pairs objects created on the nodes see the documentation of pair.

See Also

cluster_pair_blocking and cluster_pair_minsim are other methods to generate pairs.

Examples

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
stopCluster(cl)


reclin2 documentation built on May 29, 2024, 4:21 a.m.