emClassify: Weight-based Classification of Data Pairs

emClassifyR Documentation

Weight-based Classification of Data Pairs

Description

Classifies data pairs to which weights were assigned by emWeights. Based on user-defined thresholds or predefined error rates.

Usage

  emClassify(rpairs, threshold.upper = Inf,
    threshold.lower = threshold.upper, my = Inf, ny = Inf, ...)

  ## S4 method for signature 'RecLinkData,ANY,ANY'
emClassify(rpairs, threshold.upper = Inf,
    threshold.lower = threshold.upper, my = Inf, ny = Inf)

  ## S4 method for signature 'RLBigData,ANY,ANY'
emClassify(rpairs, threshold.upper = Inf,
    threshold.lower = threshold.upper, my = Inf, ny = Inf,
    withProgressBar = (sink.number()==0))

Arguments

rpairs

RecLinkData object with weight information.

my

A probability. Error bound for false positives.

ny

A probability. Error bound for false negatives.

threshold.upper

A numeric value. Threshold for links.

threshold.lower

A numeric value. Threshold for possible links.

withProgressBar

Whether to display a progress bar

...

Placeholder for method-specific arguments.

Details

Two general approaches are implemented. The classical procedure by Fellegi and Sunter (see references) minimizes the number of possible links with given error levels for false links (my) and false non-links (ny).

The second approach requires thresholds for links and possible links to be set by the user. A pair with weight w is classified as a link if w>=\textit{threshold.upper}, as a possible link if threshold.upper>=w>= treshold.lower and as a non-link if w<threshold.lower.

If threshold.upper or threshold.lower is given, the threshold-based approach is used, otherwise, if one of the error bounds is given, the Fellegi-Sunter model. If only my is supplied, links are chosen to meet the error bound and all other pairs are classified as non-links (the equivalent case holds if only ny is specified). If no further arguments than rpairs are given, a single threshold of 0 is used.

Value

For the "RecLinkData" method, a S3 object of class "RecLinkResult" that represents a copy of newdata with element rpairs$prediction, which stores the classification result, as addendum.

For the "RLBigData" method, a S4 object of class "RLResult".

Note

The quality of classification of the Fellegi-Sunter method relies strongly on reasonable estimations of m- and u-probabilities. The results should be evaluated critically.

Author(s)

Andreas Borg, Murat Sariyar

References

Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183–1210.

See Also

getPairs to produce output from which thresholds can be determined conveniently.


RecordLinkage documentation built on Nov. 10, 2022, 5:42 p.m.