stochastic: Stochastic record linkage.

stochasticR Documentation

Stochastic record linkage.

Description

Methods for stochastic record linkage following the framework of Fellegi and Sunter.

Usage



## S4 method for signature 'RecLinkData'
fsWeights(rpairs, m = 0.95, u = rpairs$frequencies, cutoff = 1)
## S4 method for signature 'RLBigData'
fsWeights(rpairs, m=0.95, u=getFrequencies(rpairs),
    cutoff=1, withProgressBar = (sink.number()==0))
## S4 method for signature 'RecLinkData'
fsClassify(rpairs, ...)
## S4 method for signature 'RLBigData'
fsClassify(rpairs, threshold.upper, threshold.lower=threshold.upper, 
  m=0.95, u=getFrequencies(rpairs), withProgressBar = (sink.number()==0), cutoff=1)

Arguments

rpairs

The record pairs to be classified.

threshold.upper

A numeric value between 0 and 1.

threshold.lower

A numeric value between 0 and 1 lower than threshold.upper.

m, u

Numeric vectors. m- and u-probabilities of matching variables, see Details.

withProgressBar

Logical. Whether to display a progress bar.

cutoff

Numeric value. Threshold for converting string comparison values to binary values.

...

Arguments passed to emClassify.

Details

These methods perform stochastic record linkage following the framework of Fellegi and Sunter (see reference).

fsWeights calculates matching weights on an object based on the specified m- and u-probabilities. Each of m and u can be a numeric vector or a single number in the range [0, 1].

fsClassify performs classification based on the calculated weights. All record pairs with weights greater or equal threshold.upper are classified as links. Record pairs with weights smaller than threshold.upper and greater or equal threshold.lower are classified as possible links. All remaining records are classified as non-links.

The "RecLinkData" method is a shortcut for emClassify.

The "RLBigData" method checks if weights are present in the underlying database. If this is the case, classification is based on the existing weights. If not, weights are calculated on the fly during classification, but not stored. The latter behaviour might be preferable when a very large dataset is to be classified and disk space is limited. A progress bar is displayed only if weights are calculated on the fly and, by default, unless output is diverted by sink (e.g. in a Sweave script).

For a general introduction to weight based record linkage, see the vignette "Weight-based deduplication".

Value

fsWeights returns a copy of the object with the calculated weights added. Note that "RLBigData" objects have some reference-style semantics, see clone for more information.

For the "RecLinkData" method, fsClassify returns a S3 object of class "RecLinkResult" that represents a copy of newdata with element rpairs$prediction, which stores the classification result, as addendum.

For the "RLBigData" method, fsClassify returns a S4 object of class "RLResult".

Author(s)

Andreas Borg, Murat Sariyar

References

Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183–1210.

See Also

epiWeights

Examples

# generate record pairs
data(RLdata500)
rpairs <- compare.dedup(RLdata500, blockfld=list(1,3,5,6,7), identity=identity.RLdata500)

# calculate weights
rpairs <- fsWeights(rpairs)

# classify and show results
summary(fsClassify(rpairs,0))

RecordLinkage documentation built on Nov. 10, 2022, 5:42 p.m.