select_n_to_m: Select matching pairs enforcing one-to-one linkage

Description Usage Arguments Details Value Examples

View source: R/select_n_to_m.R

Description

Select matching pairs enforcing one-to-one linkage

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
select_greedy(
  pairs,
  threshold = NULL,
  weight,
  var = "select",
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_n_to_m(
  pairs,
  threshold = NULL,
  weight = NULL,
  var = "select",
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

Arguments

pairs

a pairs object, such as generated by pair_blocking

threshold

the threshold to apply. Pairs with a score above the threshold are selected.

weight

name of the score/weight variable of the pairs. When not given and attr(pairs, "score") is defined, that is used.

var

the name of the new variable to create in pairs. This will be a logical variable with a value of TRUE for the selected pairs.

preselect

a logical variable with the same length as pairs has rows, or the name of such a variable in pairs. Pairs are only selected when preselect is TRUE. This interacts with threshold (pairs have to be selected with both conditions).

id_x

a integer vector with the same length a the number of rows in pairs, or the name of a column in x. This vector should identify unique objects in x. When not specified it is assumed that each element in x is unique.

id_y

a integer vector with the same length a the number of rows in pairs, or the name of a column in y. This vector should identify unique objects in y. When not specified it is assumed that each element in y is unique.

...

passed on to other methods.

n

the number of records from x that can at most be linked to a record in y.

m

the number of records from y that can at most be linked to a record in x.

Details

Both methods force one-to-one matching. select_greedy uses a greedy algorithm that selects the first pair with the highest weight. select_n_to_m tries to optimise the total weight of all of the selected pairs. In general this will result in a better selection. However, select_n_to_m uses much more memory and is much slower and, therefore, can only be used when the number of possible pairs is not too large.

Value

Returns the pairs with the variable given by var added. This is a logical variable indicating which pairs are selected a matches.

Examples

1
2
3
4
5
6
7
8
9
data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
pairs <- score_simsum(pairs)

# Select pairs with a simsum > 5 and force one-to-one linkage
pairs <- select_n_to_m(pairs, 0, var = "ntom")
pairs <- select_greedy(pairs, 0, var = "greedy")
table(pairs[c("ntom", "greedy")])

reclin documentation built on Nov. 23, 2021, 9:09 a.m.