candidates | R Documentation |
candidates
merges two datasets based on a distance criterium. The resulting dataset can be used to predict links.
candidates(
dat_from,
dat_to,
blockvariable_from = "mlast",
blockvariable_to = "mlast",
idvariable_from = "persid",
idvariable_to = "persid",
blocktype = c("bigram distance", "string distance", "numeric", "idf bigram distance",
"soundex"),
linktype = c("one:one", "many:one"),
maxdist = 0.15
)
dat_from |
The "from" dataset, should be a data.table |
dat_to |
The "from" dataset, should be a data.table |
blockvariable_from |
String giving the name of the blocking variable in the "from" data. Distance between this variable in both datasets determines whether a pair of records is a candidate. Defaults to "mlast", the male surname in the opgaafrollen data. |
blockvariable_to |
String giving the name of the blocking variable in the "to" data. Distance between this variable in both datasets determines whether a pair of records is a candidate. Defaults to "mlast", the male surname in the opgaafrollen data. |
idvariable_from |
String giving the identifier variable in dat_from. |
idvariable_to |
String giving the identifier variable in dat_from. |
blocktype |
Type of blocking: bigram distance (default), string distance or numeric. |
linktype |
Should there be no more than one record in each dataset that can be linked (one:one), or is it possible for multiple records in |
maxdist |
Maximum distance (0-1) to consider a record a candidate. Defaults to 0.15 for male surname string distance. If using numeric distance (for instance year of birth), very different values could be needed. |
Blocking on multiple variables is currently not supported, but could be done by using candidates()
repeatedly and merging the results might work.
Because historical records often provide limited information, it is possible to block on string distances. Note that this can become quite slow when there is a large number of records to in each dataset (say, tens of thousands).
String distance blocking is done using Jaro-Winkler string distances, which de-emphasise differences at the end of the string. The distance ranging from 0 (perfect match) to 1 (completely mismatch). Set the maxdist (see Arguments) accordingly.
It is currently possible to return missing values when for a given record, no candidate is found. While these records can never be matched, they are left in to make comparisons of the dataset easier.
A dataset containing all candidate pairs, and all columns in dat_from and dat_to. Columns with the same name will get a suffix "_from" or "_to".
d1 = data.table::data.table(mlast = c("jong", "smid", "nauda"), persid = c(1:3))
d2 = data.table::data.table(mlast = c("jongh", "jong", "smit", "veld"), persid = c(1:4))
candidates(d1, d2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.