candidates: Create candidate links from two datasets.
In rijpma/capelinker: Machine Learning-based Record Linkage for Historical South Africa

candidates

R Documentation

Create candidate links from two datasets.

Description

candidates merges two datasets based on a distance criterium. The resulting dataset can be used to predict links.

Usage

candidates(
  dat_from,
  dat_to,
  blockvariable_from = "mlast",
  blockvariable_to = "mlast",
  idvariable_from = "persid",
  idvariable_to = "persid",
  blocktype = c("bigram distance", "string distance", "numeric", "idf bigram distance",
    "soundex"),
  linktype = c("one:one", "many:one"),
  maxdist = 0.15
)

Arguments

`dat_from`	The "from" dataset, should be a data.table
`dat_to`	The "from" dataset, should be a data.table
`blockvariable_from`	String giving the name of the blocking variable in the "from" data. Distance between this variable in both datasets determines whether a pair of records is a candidate. Defaults to "mlast", the male surname in the opgaafrollen data.
`blockvariable_to`	String giving the name of the blocking variable in the "to" data. Distance between this variable in both datasets determines whether a pair of records is a candidate. Defaults to "mlast", the male surname in the opgaafrollen data.
`idvariable_from`	String giving the identifier variable in dat_from.
`idvariable_to`	String giving the identifier variable in dat_from.
`blocktype`	Type of blocking: bigram distance (default), string distance or numeric.
`linktype`	Should there be no more than one record in each dataset that can be linked (one:one), or is it possible for multiple records in `dat_from` to be linked to `dat_to` (many:one)? Defaults to "one:one".
`maxdist`	Maximum distance (0-1) to consider a record a candidate. Defaults to 0.15 for male surname string distance. If using numeric distance (for instance year of birth), very different values could be needed.

Details

Blocking on multiple variables is currently not supported, but could be done by using candidates() repeatedly and merging the results might work.

Because historical records often provide limited information, it is possible to block on string distances. Note that this can become quite slow when there is a large number of records to in each dataset (say, tens of thousands).

String distance blocking is done using Jaro-Winkler string distances, which de-emphasise differences at the end of the string. The distance ranging from 0 (perfect match) to 1 (completely mismatch). Set the maxdist (see Arguments) accordingly.

It is currently possible to return missing values when for a given record, no candidate is found. While these records can never be matched, they are left in to make comparisons of the dataset easier.

Value

A dataset containing all candidate pairs, and all columns in dat_from and dat_to. Columns with the same name will get a suffix "_from" or "_to".

Examples

d1 = data.table::data.table(mlast = c("jong", "smid", "nauda"), persid = c(1:3))
d2 = data.table::data.table(mlast = c("jongh", "jong", "smit", "veld"), persid = c(1:4))
candidates(d1, d2)

rijpma/capelinker documentation built on Nov. 7, 2024, 3:06 a.m.

rijpma/capelinker index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rijpma/capelinker
Machine Learning-based Record Linkage for Historical South Africa

candidates: Create candidate links from two datasets.
In rijpma/capelinker: Machine Learning-based Record Linkage for Historical South Africa

Create candidate links from two datasets.

Description

Usage

Arguments

Details

Value

Examples

Related to candidates in rijpma/capelinker...

R Package Documentation

Browse R Packages

We want your feedback!

rijpma/capelinker Machine Learning-based Record Linkage for Historical South Africa

candidates: Create candidate links from two datasets. In rijpma/capelinker: Machine Learning-based Record Linkage for Historical South Africa

Create candidate links from two datasets.

Description

Usage

Arguments

Details

Value

Examples

Related to candidates in rijpma/capelinker...

R Package Documentation

Browse R Packages

We want your feedback!

rijpma/capelinker
Machine Learning-based Record Linkage for Historical South Africa

candidates: Create candidate links from two datasets.
In rijpma/capelinker: Machine Learning-based Record Linkage for Historical South Africa