stringSubset: stringSubset
In fastLink: Fast Probabilistic Record Linkage with Missing Data

stringSubset

R Documentation

stringSubset

Description

Removes as candidate matches any observations with no close matches on string-distance measures.

Usage

stringSubset(vecA, vecB, similarity.threshold, stringdist.method,
jw.weight, n.cores)

Arguments

`vecA`	A character or factor vector from dataset A
`vecB`	A character or factor vector from dataset B
`similarity.threshold`	Lower bound on string-distance measure for being considered a possible match. If an observation has no possible matches above this threshold, it is discarded from the match. Default is 0.8.
`stringdist.method`	The method to use for calculating string-distance similarity. Possible values are 'jaro' (Jaro Distance), 'jw' (Jaro-Winkler), and 'lv' (Levenshtein). Default is 'jw'.
`jw.weight`	Parameter that describes the importance of the first characters of a string (only needed if stringdist.method = "jw"). Default is .10.
`n.cores`	Number of cores to parallelize over. Default is NULL.

Value

A list of length two, where the both entries are a vector of indices to be included in the match from dataset A (entry 1) and dataset B (entry 2).

Examples

## Not run: 
subset_out <- stringSubset(dfA$firstname, dfB$lastname, n.cores = 1)
fl_out <- fastLink(dfA[subset_out$dfA.block == 1,], dfB[subset_out$dfB.block == 1,],
varnames = c("firstname", "lastname", "streetname", "birthyear"), n.cores = 1)

## End(Not run)

fastLink documentation built on Nov. 17, 2023, 9:06 a.m.