fastLink | R Documentation |
Run the fastLink algorithm to probabilistically match two datasets.
fastLink(dfA, dfB, varnames, stringdist.match,
stringdist.method, numeric.match, partial.match,
cut.a, cut.p, jw.weight,
cut.a.num, cut.p.num,
priors.obj, w.lambda, w.pi,
address.field, gender.field, estimate.only, em.obj,
dedupe.matches, linprog.dedupe,
reweight.names, firstname.field, cond.indep,
n.cores, tol.em, threshold.match, return.all, return.df, verbose)
dfA |
Dataset A - to be matched to Dataset B |
dfB |
Dataset B - to be matched to Dataset A |
varnames |
A vector of variable names to use for matching. Must be present in both dfA and dfB |
stringdist.match |
A vector of variable names indicating which variables should use string distance matching. Must be a subset of 'varnames' and must not be present in 'numeric.match'. |
stringdist.method |
String distance method for calculating similarity, options are: "jw" Jaro-Winkler (Default), "dl" Damerau-Levenshtein, "jaro" Jaro, and "lv" Edit |
numeric.match |
A vector of variable names indicating which variables should use numeric matching. Must be a subset of 'varnames' and must not be present in 'stringdist.match'. |
partial.match |
A vector of variable names indicating whether to include a partial matching category for the string distances. Must be a subset of 'varnames' and 'stringdist.match'. |
cut.a |
Lower bound for full string-distance match, ranging between 0 and 1. Default is 0.94 |
cut.p |
Lower bound for partial string-distance match, ranging between 0 and 1. Default is 0.88 |
jw.weight |
Parameter that describes the importance of the first characters of a string (only needed if stringdist.method = "jw"). Default is .10 |
cut.a.num |
Lower bound for full numeric match. Default is 1 |
cut.p.num |
Lower bound for partial numeric match. Default is 2.5 |
priors.obj |
A list containing priors for auxiliary movers information, as output from calcMoversPriors(). Default is NULL |
w.lambda |
How much weight to give the prior on lambda versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior). Default is NULL (no prior information provided). |
w.pi |
How much weight to give the prior on pi versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior). Default is NULL (no prior information provided). |
address.field |
The name of the address field. To be used when 'pi.prior' is included in 'priors.obj'. Default is NULL (no matching variables should have address prior applied). Must be present in 'varnames'. |
gender.field |
The name of the field indicating gender. If provided, the exact-matching gender prior is used in the EM algorithm. Default is NULL (do not implement exact matching on gender). Must be present in 'varnames'. |
estimate.only |
Whether to stop running the algorithm after the EM step (omitting getting the matched indices of dataset A and dataset B). Only the EM object will be returned. Can be used when running the match on a random sample and applying to a larger dataset, or for out-of-sample prediction of matches. Default is FALSE. |
em.obj |
An EM object from a prior run of 'fastLink' or 'emlinkMARmov'. Parameter estimates will be applied to the matching patterns in 'dfA' and 'dfB'. If provided. 'estimate.only' is set to FALSE. Often provided when parameters have been estimated on a smaller sample, and the user wants to apply them to the full dataset. Default is NULL (EM will be estimated from matching patterns in 'dfA' and 'dfB'). |
dedupe.matches |
Whether to dedupe the set of matches returned by the algorithm. Default is TRUE. |
linprog.dedupe |
If deduping matches, whether to use Winkler's linear programming solution to dedupe. Default is FALSE. |
reweight.names |
Whether to reweight the posterior match probabilities by the frequency of individual first names. Default is FALSE. |
firstname.field |
The name of the field indicating first name. Must be provided if reweight.names = TRUE. |
cond.indep |
Estimates for the parameters of interest are obtained from the Fellegi-Sunter model under conditional independence. Default is TRUE. If set to FALSE parameters estimates are obtained from a model that allows for dependencies across linkage fields. |
n.cores |
Number of cores to parallelize over. Default is NULL. |
tol.em |
Convergence tolerance for the EM Algorithm. Default is 1e-04. |
threshold.match |
A number between 0 and 1 indicating either the lower bound (if only one number provided) or the range of certainty that the user wants to declare a match. For instance, threshold.match = .85 will return all pairs with posterior probability greater than .85 as matches, while threshold.match = c(.85, .95) will return all pairs with posterior probability between .85 and .95 as matches. |
return.all |
Whether to return the most likely match for each observation in dfA and dfB. Overrides user setting of |
return.df |
Whether to return the entire dataframe of dfA and dfB instead of just the indices. Default is FALSE. |
verbose |
Whether to print elapsed time for each step. Default is FALSE. |
fastLink
returns a list of class 'fastLink' containing the following components if calculating matches:
matches |
An nmatches X 2 matrix containing the indices of the successful matches in |
EM |
A list with the output of the EM algorithm, which contains the exact matching patterns and the associated posterior probabilities of a match for each matching pattern. |
patterns |
A matrix with the observed matching patterns for each successfully matched pair. |
nobs.a |
The number of observations in dataset A. |
nobs.b |
The number of observations in dataset B. |
zeta.name |
If reweighting by name, the posterior probability of a match for each match in dataset A and B. |
If only running the EM and not returning the matched indices, fastLink
only returns the EM object.
Ted Enamorado <ted.enamorado@gmail.com>, Ben Fifield <benfifield@gmail.com>, and Kosuke Imai
## Not run:
fl.out <- fastLink(dfA, dfB,
varnames = c("firstname", "lastname", "streetname", "birthyear"),
n.cores = 1)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.