epiWeights | R Documentation |
Calculates weights for record pairs based on the EpiLink approach (see references).
epiWeights(rpairs, e = 0.01, f, ...)
## S4 method for signature 'RecLinkData'
epiWeights(rpairs, e = 0.01, f = rpairs$frequencies)
## S4 method for signature 'RLBigData'
epiWeights(rpairs, e = 0.01, f = getFrequencies(rpairs),
withProgressBar = (sink.number()==0))
rpairs |
The record pairs for which to compute weights. See details. |
e |
Numeric vector. Estimated error rate(s). |
f |
Numeric vector. Average frequency of attribute values. |
withProgressBar |
Whether to display a progress bar |
... |
Placeholder for method-specific arguments. |
This function calculates weights for record pairs based on the approach used by Contiero et alia in the EpiLink record linkage software (see references).
Since package version 0.3, this is a generic function with methods for S3 objects of class RecLinkData
as well as S4 objects of classes "RLBigDataDedup"
and
"RLBigDataLinkage"
.
The weight for a record pair (x^{1},x^{2})
is computed by the formula
\frac{\sum_{i}w_{i}s(x^{1}_{i},x^{2}_{i})}{\sum_{i}w_{i}}
where s(x^{1}_{i},x^{2}_{i})
is the value of a string comparison of
records x^{1}
and x^{2}
in the i-th field and w_{i}
is a weighting factor computed by
w_{i}=\log_{2}(1-e_{i})/f_{i}
, where f_{i}
denotes the average frequency of values and e_{i}
the estimated error rate for field i
.
String comparison values are taken from the record pairs as they were generated with compare.dedup
or compare.linkage
. The use of binary patterns is possible, but in general yields poor results.
The average frequency of values is by default taken from the object rpairs
. Both frequency and error rate e
can be set to a single value, which will be recycled, or to a vector with distinct error rates for every field.
The error rate(s) and frequencie(s) must satisfy
e_{i}\leq{}1-f_{i}
for all i
, otherwise the functions fails. Also, some other rare combinations can result in weights with illegal values (NaN, less than 0 or greater than 1). In this case a warning is issued.
By default, the "RLBigDataDedup"
method displays a progress bar unless output is diverted by sink
, e.g. when processing a Sweave file.
A copy of rpairs
with the weights attached. See the class documentation
(RecLinkData
, "RLBigDataDedup"
and
"RLBigDataLinkage"
) on how weights are stored.
For the "RLBigData"
method, the returned object is only a shallow
copy in the sense that it links to the same ff data files as database file as
rpairs
.
The "RLBigData"
method creates a "ffvector"
object,
for which a disk file is created.
Andreas Borg, Murat Sariyar
P. Contiero et al., The EpiLink record linkage software, in: Methods of Information in Medicine 2005, 44 (1), 66–71.
epiClassify
for classification based on EpiLink weights.
emWeights
for a different approach for weight calculation.
# generate record pairs
data(RLdata500)
p=compare.dedup(RLdata500,strcmp=TRUE ,strcmpfun=levenshteinSim,
identity=identity.RLdata500, blockfld=list("by", "bm", "bd"))
# calculate weights
p=epiWeights(p)
# classify and show results
summary(epiClassify(p,0.6))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.