| optimalThreshold | R Documentation |
Calculates the optimal threshold for weight-based Record Linkage.
optimalThreshold(rpairs, my = NaN, ny = NaN)
## S4 method for signature 'RecLinkData'
optimalThreshold(rpairs, my = NaN, ny = NaN)
## S4 method for signature 'RLBigData'
optimalThreshold(rpairs, my = NaN, ny = NaN)
rpairs |
Record pairs for which to calculate a threshold. |
my |
A real value in the range [0,1]. Error bound for false positives. |
ny |
A real value in the range [0,1]. Error bound for false negatives. |
Weights must have been calculated for rpairs, for example by
emWeights or epiWeights.
The true match result must be known for rpairs, mostly this is provided
through the identity argument of compare.*
For the following, it is assumed that all records with weights greater than or
equal to the threshold are classified as links, the remaining as non-links.
If no further arguments are given, a threshold which minimizes the
absolute number of misclassified record pairs is returned. If my is
supplied (ny is ignored in this case), a threshold is picked which
maximizes the number of correctly classified links while keeping the ratio
of false links to the total number of links below or equal my.
If ny is supplied, the number of correct non-links is maximized under the
condition that the ratio of falsely classified non-links to the total number of
non-links does not exceed ny.
Two separate runs of optimalThreshold with values for my and
ny respectively allow for obtaining a lower and an upper threshold
for a three-way classification approach (yielding links, non-links and
possible links).
A numeric value, the calculated threshold.
Andreas Borg, Murat Sariyar
emWeights
emClassify
epiWeights
epiClassify
# create record pairs
data(RLdata500)
p=compare.dedup(RLdata500,identity=identity.RLdata500, strcmp=TRUE,
strcmpfun=levenshteinSim)
# calculate weights
p=epiWeights(p)
# split record pairs in two sets
l=splitData(dataset=p, prop=0.5, keep.mprop=TRUE)
# get threshold from training set
threshold=optimalThreshold(l$train)
# classify remaining data
summary(epiClassify(l$valid,threshold))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.