fastLink: fastLink

View source: R/fastLink.R

fastLinkR Documentation

fastLink

Description

Run the fastLink algorithm to probabilistically match two datasets.

Usage

fastLink(dfA, dfB, varnames, stringdist.match,
stringdist.method, numeric.match, partial.match,
cut.a, cut.p, jw.weight,
cut.a.num, cut.p.num,
priors.obj, w.lambda, w.pi,
address.field, gender.field, estimate.only, em.obj,
dedupe.matches, linprog.dedupe,
reweight.names, firstname.field, cond.indep,
n.cores, tol.em, threshold.match, return.all, return.df, verbose)

Arguments

dfA

Dataset A - to be matched to Dataset B

dfB

Dataset B - to be matched to Dataset A

varnames

A vector of variable names to use for matching. Must be present in both dfA and dfB

stringdist.match

A vector of variable names indicating which variables should use string distance matching. Must be a subset of 'varnames' and must not be present in 'numeric.match'.

stringdist.method

String distance method for calculating similarity, options are: "jw" Jaro-Winkler (Default), "dl" Damerau-Levenshtein, "jaro" Jaro, and "lv" Edit

numeric.match

A vector of variable names indicating which variables should use numeric matching. Must be a subset of 'varnames' and must not be present in 'stringdist.match'.

partial.match

A vector of variable names indicating whether to include a partial matching category for the string distances. Must be a subset of 'varnames' and 'stringdist.match'.

cut.a

Lower bound for full string-distance match, ranging between 0 and 1. Default is 0.94

cut.p

Lower bound for partial string-distance match, ranging between 0 and 1. Default is 0.88

jw.weight

Parameter that describes the importance of the first characters of a string (only needed if stringdist.method = "jw"). Default is .10

cut.a.num

Lower bound for full numeric match. Default is 1

cut.p.num

Lower bound for partial numeric match. Default is 2.5

priors.obj

A list containing priors for auxiliary movers information, as output from calcMoversPriors(). Default is NULL

w.lambda

How much weight to give the prior on lambda versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior). Default is NULL (no prior information provided).

w.pi

How much weight to give the prior on pi versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior). Default is NULL (no prior information provided).

address.field

The name of the address field. To be used when 'pi.prior' is included in 'priors.obj'. Default is NULL (no matching variables should have address prior applied). Must be present in 'varnames'.

gender.field

The name of the field indicating gender. If provided, the exact-matching gender prior is used in the EM algorithm. Default is NULL (do not implement exact matching on gender). Must be present in 'varnames'.

estimate.only

Whether to stop running the algorithm after the EM step (omitting getting the matched indices of dataset A and dataset B). Only the EM object will be returned. Can be used when running the match on a random sample and applying to a larger dataset, or for out-of-sample prediction of matches. Default is FALSE.

em.obj

An EM object from a prior run of 'fastLink' or 'emlinkMARmov'. Parameter estimates will be applied to the matching patterns in 'dfA' and 'dfB'. If provided. 'estimate.only' is set to FALSE. Often provided when parameters have been estimated on a smaller sample, and the user wants to apply them to the full dataset. Default is NULL (EM will be estimated from matching patterns in 'dfA' and 'dfB').

dedupe.matches

Whether to dedupe the set of matches returned by the algorithm. Default is TRUE.

linprog.dedupe

If deduping matches, whether to use Winkler's linear programming solution to dedupe. Default is FALSE.

reweight.names

Whether to reweight the posterior match probabilities by the frequency of individual first names. Default is FALSE.

firstname.field

The name of the field indicating first name. Must be provided if reweight.names = TRUE.

cond.indep

Estimates for the parameters of interest are obtained from the Fellegi-Sunter model under conditional independence. Default is TRUE. If set to FALSE parameters estimates are obtained from a model that allows for dependencies across linkage fields.

n.cores

Number of cores to parallelize over. Default is NULL.

tol.em

Convergence tolerance for the EM Algorithm. Default is 1e-04.

threshold.match

A number between 0 and 1 indicating either the lower bound (if only one number provided) or the range of certainty that the user wants to declare a match. For instance, threshold.match = .85 will return all pairs with posterior probability greater than .85 as matches, while threshold.match = c(.85, .95) will return all pairs with posterior probability between .85 and .95 as matches.

return.all

Whether to return the most likely match for each observation in dfA and dfB. Overrides user setting of threshold.match by setting threshold.match to 0.0001, and automatically dedupes all matches. Default is FALSE.

return.df

Whether to return the entire dataframe of dfA and dfB instead of just the indices. Default is FALSE.

verbose

Whether to print elapsed time for each step. Default is FALSE.

Value

fastLink returns a list of class 'fastLink' containing the following components if calculating matches:

matches

An nmatches X 2 matrix containing the indices of the successful matches in dfA in the first column, and the indices of the corresponding successful matches in dfB in the second column.

EM

A list with the output of the EM algorithm, which contains the exact matching patterns and the associated posterior probabilities of a match for each matching pattern.

patterns

A matrix with the observed matching patterns for each successfully matched pair.

nobs.a

The number of observations in dataset A.

nobs.b

The number of observations in dataset B.

zeta.name

If reweighting by name, the posterior probability of a match for each match in dataset A and B.

If only running the EM and not returning the matched indices, fastLink only returns the EM object.

Author(s)

Ted Enamorado <ted.enamorado@gmail.com>, Ben Fifield <benfifield@gmail.com>, and Kosuke Imai

Examples

## Not run: 
fl.out <- fastLink(dfA, dfB,
varnames = c("firstname", "lastname", "streetname", "birthyear"),
n.cores = 1)

## End(Not run)

fastLink documentation built on Nov. 17, 2023, 9:06 a.m.