epiWeights | R Documentation |
Calculates weights for record pairs based on the EpiLink approach (see references).
epiWeights(rpairs, e = 0.01, f, ...) ## S4 method for signature 'RecLinkData' epiWeights(rpairs, e = 0.01, f = rpairs$frequencies) ## S4 method for signature 'RLBigData' epiWeights(rpairs, e = 0.01, f = getFrequencies(rpairs), withProgressBar = (sink.number()==0))
rpairs |
The record pairs for which to compute weights. See details. |
e |
Numeric vector. Estimated error rate(s). |
f |
Numeric vector. Average frequency of attribute values. |
withProgressBar |
Whether to display a progress bar |
... |
Placeholder for method-specific arguments. |
This function calculates weights for record pairs based on the approach used by Contiero et alia in the EpiLink record linkage software (see references).
Since package version 0.3, this is a generic function with methods for S3 objects of class RecLinkData
as well as S4 objects of classes "RLBigDataDedup"
and
"RLBigDataLinkage"
.
The weight for a record pair (x1,x2) is computed by the formula
sum_i (w_1 * s(x1_i, x2_i)) / sum_i w_i
where s(x1_i, x2_i) is the value of a string comparison of records x1 and x2 in the i-th field and w_i is a weighting factor computed by
w_i = log_2 (1-e_i) / f_i
, where f_i denotes the average frequency of values and e_i the estimated error rate for field i.
String comparison values are taken from the record pairs as they were generated with compare.dedup
or compare.linkage
. The use of binary patterns is possible, but in general yields poor results.
The average frequency of values is by default taken from the object rpairs
. Both frequency and error rate e
can be set to a single value, which will be recycled, or to a vector with distinct error rates for every field.
The error rate(s) and frequencie(s) must satisfy e[i] <= 1-f[i] for all i, otherwise the functions fails. Also, some other rare combinations can result in weights with illegal values (NaN, less than 0 or greater than 1). In this case a warning is issued.
By default, the "RLBigDataDedup"
method displays a progress bar unless output is diverted by sink
, e.g. when processing a Sweave file.
A copy of rpairs
with the weights attached. See the class documentation
(RecLinkData
, "RLBigDataDedup"
and
"RLBigDataLinkage"
) on how weights are stored.
For the "RLBigData"
method, the returned object is only a shallow
copy in the sense that it links to the same ff data files as database file as
rpairs
.
The "RLBigData"
method creates a "ffvector"
object,
for which a disk file is created.
Andreas Borg, Murat Sariyar
P. Contiero et al., The EpiLink record linkage software, in: Methods of Information in Medicine 2005, 44 (1), 66–71.
epiClassify
for classification based on EpiLink weights.
emWeights
for a different approach for weight calculation.
# generate record pairs data(RLdata500) p=compare.dedup(RLdata500,strcmp=TRUE ,strcmpfun=levenshteinSim, identity=identity.RLdata500, blockfld=list("by", "bm", "bd")) # calculate weights p=epiWeights(p) # classify and show results summary(epiClassify(p,0.6))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.