candidates: Create candidate links from two datasets.

View source: R/candidates.R

candidatesR Documentation

Create candidate links from two datasets.

Description

candidates merges two datasets based on a distance criterium. The resulting dataset can be used to predict links.

Usage

candidates(
  dat_from,
  dat_to,
  blockvariable_from = "mlast",
  blockvariable_to = "mlast",
  idvariable_from = "persid",
  idvariable_to = "persid",
  blocktype = c("bigram distance", "string distance", "numeric", "idf bigram distance",
    "soundex"),
  linktype = c("one:one", "many:one"),
  maxdist = 0.15
)

Arguments

dat_from

The "from" dataset, should be a data.table

dat_to

The "from" dataset, should be a data.table

blockvariable_from

String giving the name of the blocking variable in the "from" data. Distance between this variable in both datasets determines whether a pair of records is a candidate. Defaults to "mlast", the male surname in the opgaafrollen data.

blockvariable_to

String giving the name of the blocking variable in the "to" data. Distance between this variable in both datasets determines whether a pair of records is a candidate. Defaults to "mlast", the male surname in the opgaafrollen data.

idvariable_from

String giving the identifier variable in dat_from.

idvariable_to

String giving the identifier variable in dat_from.

blocktype

Type of blocking: bigram distance (default), string distance or numeric.

linktype

Should there be no more than one record in each dataset that can be linked (one:one), or is it possible for multiple records in dat_from to be linked to dat_to (many:one)? Defaults to "one:one".

maxdist

Maximum distance (0-1) to consider a record a candidate. Defaults to 0.15 for male surname string distance. If using numeric distance (for instance year of birth), very different values could be needed.

Details

Blocking on multiple variables is currently not supported, but could be done by using candidates() repeatedly and merging the results might work.

Because historical records often provide limited information, it is possible to block on string distances. Note that this can become quite slow when there is a large number of records to in each dataset (say, tens of thousands).

String distance blocking is done using Jaro-Winkler string distances, which de-emphasise differences at the end of the string. The distance ranging from 0 (perfect match) to 1 (completely mismatch). Set the maxdist (see Arguments) accordingly.

It is currently possible to return missing values when for a given record, no candidate is found. While these records can never be matched, they are left in to make comparisons of the dataset easier.

Value

A dataset containing all candidate pairs, and all columns in dat_from and dat_to. Columns with the same name will get a suffix "_from" or "_to".

Examples

d1 = data.table::data.table(mlast = c("jong", "smid", "nauda"), persid = c(1:3))
d2 = data.table::data.table(mlast = c("jongh", "jong", "smit", "veld"), persid = c(1:4))
candidates(d1, d2)
 

rijpma/capelinker documentation built on Nov. 7, 2024, 3:06 a.m.