optimalThreshold | R Documentation |
Calculates the optimal threshold for weight-based Record Linkage.
optimalThreshold(rpairs, my = NaN, ny = NaN) ## S4 method for signature 'RecLinkData' optimalThreshold(rpairs, my = NaN, ny = NaN) ## S4 method for signature 'RLBigData' optimalThreshold(rpairs, my = NaN, ny = NaN)
rpairs |
Record pairs for which to calculate a threshold. |
my |
A real value in the range [0,1]. Error bound for false positives. |
ny |
A real value in the range [0,1]. Error bound for false negatives. |
Weights must have been calculated for rpairs
, for example by
emWeights
or epiWeights
.
The true match result must be known for rpairs
, mostly this is provided
through the identity
argument of compare.*
For the following, it is assumed that all records with weights greater than or
equal to the threshold are classified as links, the remaining as non-links.
If no further arguments are given, a threshold which minimizes the
absolute number of misclassified record pairs is returned. If my
is
supplied (ny
is ignored in this case), a threshold is picked which
maximizes the number of correctly classified links while keeping the ratio
of false links to the total number of links below or equal my
.
If ny
is supplied, the number of correct non-links is maximized under the
condition that the ratio of falsely classified non-links to the total number of
non-links does not exceed ny
.
Two separate runs of optimalThreshold
with values for my
and
ny
respectively allow for obtaining a lower and an upper threshold
for a three-way classification approach (yielding links, non-links and
possible links).
A numeric value, the calculated threshold.
Andreas Borg, Murat Sariyar
emWeights
emClassify
epiWeights
epiClassify
# create record pairs data(RLdata500) p=compare.dedup(RLdata500,identity=identity.RLdata500, strcmp=TRUE, strcmpfun=levenshteinSim) # calculate weights p=epiWeights(p) # split record pairs in two sets l=splitData(dataset=p, prop=0.5, keep.mprop=TRUE) # get threshold from training set threshold=optimalThreshold(l$train) # classify remaining data summary(epiClassify(l$valid,threshold))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.