View source: R/select_n_to_m.R
select_greedy.cluster_pairs | R Documentation |
Select matching pairs enforcing one-to-one linkage
## S3 method for class 'cluster_pairs'
select_greedy(
pairs,
variable,
score,
threshold,
preselect = NULL,
id_x = NULL,
id_y = NULL,
...
)
## S3 method for class 'cluster_pairs'
select_n_to_m(
pairs,
variable,
score,
threshold,
preselect = NULL,
id_x = NULL,
id_y = NULL,
...
)
select_greedy(
pairs,
variable,
score,
threshold,
preselect = NULL,
id_x = NULL,
id_y = NULL,
...
)
## S3 method for class 'pairs'
select_greedy(
pairs,
variable,
score,
threshold,
preselect = NULL,
id_x = NULL,
id_y = NULL,
x = attr(pairs, "x"),
y = attr(pairs, "y"),
inplace = FALSE,
include_ties = FALSE,
n = 1L,
m = 1L,
...
)
select_n_to_m(
pairs,
variable,
score,
threshold,
preselect = NULL,
id_x = NULL,
id_y = NULL,
...
)
## S3 method for class 'pairs'
select_n_to_m(
pairs,
variable,
score,
threshold,
preselect = NULL,
id_x = NULL,
id_y = NULL,
x = attr(pairs, "x"),
y = attr(pairs, "y"),
inplace = FALSE,
...
)
pairs |
a |
variable |
the name of the new variable to create in pairs. This will be a
logical variable with a value of |
score |
name of the score/weight variable of the pairs. When not given
and |
threshold |
the threshold to apply. Pairs with a score above the threshold are selected. |
preselect |
a logical variable with the same length as |
id_x |
a integer vector with the same length as the number of rows in
|
id_y |
a integer vector with the same length as the number of rows in
|
... |
Used to pass additional arguments to methods |
x |
|
y |
|
inplace |
logical indicating whether |
include_ties |
when pairs for a given record have an equal weight, should all pairs be included. |
n |
an integer. Each element of x can be linked to at most n elements of y. |
m |
an integer. Each element of y can be linked to at most m elements of x. |
Both methods force one-to-one matching. select_greedy
uses a greedy
algorithm that selects the first pair with the highest weight.
select_n_to_m
tries to optimise the total weight of all of the
selected pairs. In general this will result in a better selection. However,
select_n_to_m
uses much more memory and is much slower and, therefore,
can only be used when the number of possible pairs is not too large.
Note that when include_ties = TRUE
the same record can still be
selected more than once. In that case the pairs will have an equal weight.
Returns the pairs
with the variable given by variable
added. This
is a logical variable indicating which pairs are selected as matches.
data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5 and force one-to-one linkage
pairs <- select_n_to_m(pairs, "ntom", "mpost", 0.5)
pairs <- select_greedy(pairs, "greedy", "mpost", 0.5)
table(pairs$ntom, pairs$greedy)
# The same example as above using a cluster;
library(parallel)
cl <- makeCluster(2)
pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5 and force one-to-one linkage
# select_n_to_m and select_greedy only work on pairs that are local;
# therefore we first collect the pairs
select_threshold(pairs, "selected", "mpost", 0.5)
local_pairs <- cluster_collect(pairs, "selected")
local_pairs <- select_n_to_m(local_pairs, "ntom", "mpost", 0.5)
local_pairs <- select_greedy(local_pairs, "greedy", "mpost", 0.5)
table(local_pairs$ntom, local_pairs$greedy)
stopCluster(cl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.