stochastic | R Documentation |
Methods for stochastic record linkage following the framework of Fellegi and Sunter.
## S4 method for signature 'RecLinkData' fsWeights(rpairs, m = 0.95, u = rpairs$frequencies, cutoff = 1) ## S4 method for signature 'RLBigData' fsWeights(rpairs, m=0.95, u=getFrequencies(rpairs), cutoff=1, withProgressBar = (sink.number()==0)) ## S4 method for signature 'RecLinkData' fsClassify(rpairs, ...) ## S4 method for signature 'RLBigData' fsClassify(rpairs, threshold.upper, threshold.lower=threshold.upper, m=0.95, u=getFrequencies(rpairs), withProgressBar = (sink.number()==0), cutoff=1)
rpairs |
The record pairs to be classified. |
threshold.upper |
A numeric value between 0 and 1. |
threshold.lower |
A numeric value between 0 and 1 lower than |
m, u |
Numeric vectors. m- and u-probabilities of matching variables, see Details. |
withProgressBar |
Logical. Whether to display a progress bar. |
cutoff |
Numeric value. Threshold for converting string comparison values to binary values. |
... |
Arguments passed to emClassify. |
These methods perform stochastic record linkage following the framework of Fellegi and Sunter (see reference).
fsWeights
calculates matching weights on an object based on the
specified m- and u-probabilities. Each of m
and u
can be a
numeric vector or a single number in the range [0, 1].
fsClassify
performs classification based on the calculated weights.
All record pairs with weights greater or
equal threshold.upper
are classified as links. Record pairs with
weights smaller than threshold.upper
and greater or equal
threshold.lower
are classified as possible links. All remaining
records are classified as non-links.
The "RecLinkData"
method is a shortcut for emClassify
.
The "RLBigData"
method checks if weights are present in the underlying database. If this is the case, classification
is based on the existing weights. If not, weights are calculated on the fly during classification, but not stored. The latter behaviour might be preferable when a very large dataset is to be classified and disk space is limited.
A progress bar is displayed only if weights are calculated on the fly and, by default, unless output is diverted by
sink
(e.g. in a Sweave script).
For a general introduction to weight based record linkage, see the vignette "Weight-based deduplication".
fsWeights
returns a copy of the object with the calculated weights
added. Note that "RLBigData"
objects have some
reference-style semantics, see clone for more information.
For the "RecLinkData"
method, fsClassify
returns a S3 object
of class "RecLinkResult"
that represents a copy
of newdata
with element rpairs$prediction
, which stores
the classification result, as addendum.
For the "RLBigData"
method, fsClassify
returns
a S4 object of class "RLResult"
.
Andreas Borg, Murat Sariyar
Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183–1210.
epiWeights
# generate record pairs data(RLdata500) rpairs <- compare.dedup(RLdata500, blockfld=list(1,3,5,6,7), identity=identity.RLdata500) # calculate weights rpairs <- fsWeights(rpairs) # classify and show results summary(fsClassify(rpairs,0))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.