fastLink: fastLink
In fastLink: Fast Probabilistic Record Linkage with Missing Data

fastLink

R Documentation

fastLink

Description

Run the fastLink algorithm to probabilistically match two datasets.

Usage

fastLink(dfA, dfB, varnames, stringdist.match,
stringdist.method, numeric.match, partial.match,
cut.a, cut.p, jw.weight,
cut.a.num, cut.p.num,
priors.obj, w.lambda, w.pi,
address.field, gender.field, estimate.only, em.obj,
dedupe.matches, linprog.dedupe,
reweight.names, firstname.field, cond.indep,
n.cores, tol.em, threshold.match, return.all, return.df, verbose)

Arguments

`dfA`	Dataset A - to be matched to Dataset B
`dfB`	Dataset B - to be matched to Dataset A
`varnames`	A vector of variable names to use for matching. Must be present in both dfA and dfB
`stringdist.match`	A vector of variable names indicating which variables should use string distance matching. Must be a subset of 'varnames' and must not be present in 'numeric.match'.
`stringdist.method`	String distance method for calculating similarity, options are: "jw" Jaro-Winkler (Default), "dl" Damerau-Levenshtein, "jaro" Jaro, and "lv" Edit
`numeric.match`	A vector of variable names indicating which variables should use numeric matching. Must be a subset of 'varnames' and must not be present in 'stringdist.match'.
`partial.match`	A vector of variable names indicating whether to include a partial matching category for the string distances. Must be a subset of 'varnames' and 'stringdist.match'.
`cut.a`	Lower bound for full string-distance match, ranging between 0 and 1. Default is 0.94
`cut.p`	Lower bound for partial string-distance match, ranging between 0 and 1. Default is 0.88
`jw.weight`	Parameter that describes the importance of the first characters of a string (only needed if stringdist.method = "jw"). Default is .10
`cut.a.num`	Lower bound for full numeric match. Default is 1
`cut.p.num`	Lower bound for partial numeric match. Default is 2.5
`priors.obj`	A list containing priors for auxiliary movers information, as output from calcMoversPriors(). Default is NULL
`w.lambda`	How much weight to give the prior on lambda versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior). Default is NULL (no prior information provided).
`w.pi`	How much weight to give the prior on pi versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior). Default is NULL (no prior information provided).
`address.field`	The name of the address field. To be used when 'pi.prior' is included in 'priors.obj'. Default is NULL (no matching variables should have address prior applied). Must be present in 'varnames'.
`gender.field`	The name of the field indicating gender. If provided, the exact-matching gender prior is used in the EM algorithm. Default is NULL (do not implement exact matching on gender). Must be present in 'varnames'.
`estimate.only`	Whether to stop running the algorithm after the EM step (omitting getting the matched indices of dataset A and dataset B). Only the EM object will be returned. Can be used when running the match on a random sample and applying to a larger dataset, or for out-of-sample prediction of matches. Default is FALSE.
`em.obj`	An EM object from a prior run of 'fastLink' or 'emlinkMARmov'. Parameter estimates will be applied to the matching patterns in 'dfA' and 'dfB'. If provided. 'estimate.only' is set to FALSE. Often provided when parameters have been estimated on a smaller sample, and the user wants to apply them to the full dataset. Default is NULL (EM will be estimated from matching patterns in 'dfA' and 'dfB').
`dedupe.matches`	Whether to dedupe the set of matches returned by the algorithm. Default is TRUE.
`linprog.dedupe`	If deduping matches, whether to use Winkler's linear programming solution to dedupe. Default is FALSE.
`reweight.names`	Whether to reweight the posterior match probabilities by the frequency of individual first names. Default is FALSE.
`firstname.field`	The name of the field indicating first name. Must be provided if reweight.names = TRUE.
`cond.indep`	Estimates for the parameters of interest are obtained from the Fellegi-Sunter model under conditional independence. Default is TRUE. If set to FALSE parameters estimates are obtained from a model that allows for dependencies across linkage fields.
`n.cores`	Number of cores to parallelize over. Default is NULL.
`tol.em`	Convergence tolerance for the EM Algorithm. Default is 1e-04.
`threshold.match`	A number between 0 and 1 indicating either the lower bound (if only one number provided) or the range of certainty that the user wants to declare a match. For instance, threshold.match = .85 will return all pairs with posterior probability greater than .85 as matches, while threshold.match = c(.85, .95) will return all pairs with posterior probability between .85 and .95 as matches.
`return.all`	Whether to return the most likely match for each observation in dfA and dfB. Overrides user setting of `threshold.match` by setting `threshold.match` to 0.0001, and automatically dedupes all matches. Default is FALSE.
`return.df`	Whether to return the entire dataframe of dfA and dfB instead of just the indices. Default is FALSE.
`verbose`	Whether to print elapsed time for each step. Default is FALSE.

Value

fastLink returns a list of class 'fastLink' containing the following components if calculating matches:

`matches`	An nmatches X 2 matrix containing the indices of the successful matches in `dfA` in the first column, and the indices of the corresponding successful matches in `dfB` in the second column.
`EM`	A list with the output of the EM algorithm, which contains the exact matching patterns and the associated posterior probabilities of a match for each matching pattern.
`patterns`	A matrix with the observed matching patterns for each successfully matched pair.
`nobs.a`	The number of observations in dataset A.
`nobs.b`	The number of observations in dataset B.
`zeta.name`	If reweighting by name, the posterior probability of a match for each match in dataset A and B.

If only running the EM and not returning the matched indices, fastLink only returns the EM object.

Author(s)

Ted Enamorado <ted.enamorado@gmail.com>, Ben Fifield <benfifield@gmail.com>, and Kosuke Imai

Examples

## Not run: 
fl.out <- fastLink(dfA, dfB,
varnames = c("firstname", "lastname", "streetname", "birthyear"),
n.cores = 1)

## End(Not run)

fastLink documentation built on Nov. 17, 2023, 9:06 a.m.

fastLink index

Package overview

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

fastLink
Fast Probabilistic Record Linkage with Missing Data

fastLink: fastLink
In fastLink: Fast Probabilistic Record Linkage with Missing Data

fastLink

Description

Usage

Arguments

Value

Author(s)

Examples

Related to fastLink in fastLink...

R Package Documentation

Browse R Packages

We want your feedback!

fastLink Fast Probabilistic Record Linkage with Missing Data

fastLink: fastLink In fastLink: Fast Probabilistic Record Linkage with Missing Data

fastLink

Description

Usage

Arguments

Value

Author(s)

Examples

Related to fastLink in fastLink...

R Package Documentation

Browse R Packages

We want your feedback!

fastLink
Fast Probabilistic Record Linkage with Missing Data

fastLink: fastLink
In fastLink: Fast Probabilistic Record Linkage with Missing Data