select_unique: Deselect pairs that are linked to multiple records

View source: R/select_unique.R

select_unique.cluster_pairsR Documentation

Deselect pairs that are linked to multiple records

Description

Deselect pairs that are linked to multiple records

Usage

## S3 method for class 'cluster_pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)

Arguments

pairs

a pairs object, such as generated by pair_blocking

variable

the name of the new variable to create in pairs. This will be a logical variable with a value of TRUE for the selected pairs.

preselect

a logical variable with the same length as pairs has rows, or the name of such a variable in pairs. Pairs are only selected when preselect is TRUE.

n

do not select pairs with a y-record that is linked to more than n records.

m

do not select pairs with a m-record that is linked to more than m records.

id_x

a integer vector with the same length as the number of rows in pairs, or the name of a column in x. This vector should identify unique objects in x. When not specified it is assumed that each element in x is unique.

id_y

a integer vector with the same length as the number of rows in pairs, or the name of a column in y. This vector should identify unique objects in y. When not specified it is assumed that each element in y is unique.

...

Used to pass additional arguments to methods

x

data.table with one half of the pairs.

y

data.table with the other half of the pairs.

inplace

logical indicating whether pairs should be modified in place. When pairs is large this can be more efficient.

Details

This function can be used to remove pairs for which there is ambiguity. For example when a record from x is linked to multiple records from y and we know that there are no duplicate records in y (records that belong to the same object), then we know that at least on of the two links is incorrect but we cannot decide which of the two. In that case we may want to decide that we will not link both records. Running select_unique with m == 1 will remove both records.

In case one wants to select one of the records randomly: select_greedy will select the pair with the highest weight and in case of an equal weight the first. Adding a random component to the weights will ensure a random selection.

Value

Returns the pairs with the variable given by variable added. This is a logical variable indicating which pairs are selected as matches.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
  default_comparator = jaro_winkler(0.9), inplace = TRUE)
score_simple(pairs, "score", 
  on = c("lastname", "firstname", "address", "sex"),
  w1 = list(lastname = 2), inplace = TRUE)
select_threshold(pairs, variable = "select", 
  score = "score", threshold = 4.0, inplace =  TRUE)
select_unique(pairs, variable = "select_unique", preselect = "select")


reclin2 documentation built on May 29, 2024, 4:21 a.m.