emClassify | R Documentation |
Classifies data pairs to which weights were assigned by emWeights
.
Based on user-defined thresholds or predefined error rates.
emClassify(rpairs, threshold.upper = Inf, threshold.lower = threshold.upper, my = Inf, ny = Inf, ...) ## S4 method for signature 'RecLinkData,ANY,ANY' emClassify(rpairs, threshold.upper = Inf, threshold.lower = threshold.upper, my = Inf, ny = Inf) ## S4 method for signature 'RLBigData,ANY,ANY' emClassify(rpairs, threshold.upper = Inf, threshold.lower = threshold.upper, my = Inf, ny = Inf, withProgressBar = (sink.number()==0))
rpairs |
|
my |
A probability. Error bound for false positives. |
ny |
A probability. Error bound for false negatives. |
threshold.upper |
A numeric value. Threshold for links. |
threshold.lower |
A numeric value. Threshold for possible links. |
withProgressBar |
Whether to display a progress bar |
... |
Placeholder for method-specific arguments. |
Two general approaches are implemented. The classical procedure
by Fellegi and Sunter (see references) minimizes the number of
possible links with given error levels for false links (my
) and
false non-links (ny
).
The second approach requires thresholds for links and possible links to be set by the user. A pair with weight w is classified as a link if w>=\textit{threshold.upper}, as a possible link if threshold.upper>=w>= treshold.lower and as a non-link if w<threshold.lower.
If threshold.upper
or threshold.lower
is given, the
threshold-based approach is used, otherwise, if one of the error bounds is
given, the Fellegi-Sunter model. If only my
is supplied, links are
chosen to meet the error bound and all other pairs are classified as non-links
(the equivalent case holds if only ny
is specified). If no further arguments
than rpairs
are given, a single threshold of 0 is used.
For the "RecLinkData"
method, a S3 object
of class "RecLinkResult"
that represents a copy
of newdata
with element rpairs$prediction
, which stores
the classification result, as addendum.
For the "RLBigData"
method, a S4 object of class
"RLResult"
.
The quality of classification of the Fellegi-Sunter method relies strongly on reasonable estimations of m- and u-probabilities. The results should be evaluated critically.
Andreas Borg, Murat Sariyar
Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183–1210.
getPairs
to produce output from which thresholds can
be determined conveniently.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.