epiWeights: Calculate EpiLink weights

epiWeightsR Documentation

Calculate EpiLink weights

Description

Calculates weights for record pairs based on the EpiLink approach (see references).

Usage

  epiWeights(rpairs, e = 0.01, f, ...)

  ## S4 method for signature 'RecLinkData'
epiWeights(rpairs, e = 0.01, f = rpairs$frequencies)

  ## S4 method for signature 'RLBigData'
epiWeights(rpairs, e = 0.01, f = getFrequencies(rpairs),
    withProgressBar = (sink.number()==0))

Arguments

rpairs

The record pairs for which to compute weights. See details.

e

Numeric vector. Estimated error rate(s).

f

Numeric vector. Average frequency of attribute values.

withProgressBar

Whether to display a progress bar

...

Placeholder for method-specific arguments.

Details

This function calculates weights for record pairs based on the approach used by Contiero et alia in the EpiLink record linkage software (see references).

Since package version 0.3, this is a generic function with methods for S3 objects of class RecLinkData as well as S4 objects of classes "RLBigDataDedup" and "RLBigDataLinkage".

The weight for a record pair (x1,x2) is computed by the formula

sum_i (w_1 * s(x1_i, x2_i)) / sum_i w_i

where s(x1_i, x2_i) is the value of a string comparison of records x1 and x2 in the i-th field and w_i is a weighting factor computed by

w_i = log_2 (1-e_i) / f_i

, where f_i denotes the average frequency of values and e_i the estimated error rate for field i.

String comparison values are taken from the record pairs as they were generated with compare.dedup or compare.linkage. The use of binary patterns is possible, but in general yields poor results.

The average frequency of values is by default taken from the object rpairs. Both frequency and error rate e can be set to a single value, which will be recycled, or to a vector with distinct error rates for every field.

The error rate(s) and frequencie(s) must satisfy e[i] <= 1-f[i] for all i, otherwise the functions fails. Also, some other rare combinations can result in weights with illegal values (NaN, less than 0 or greater than 1). In this case a warning is issued.

By default, the "RLBigDataDedup" method displays a progress bar unless output is diverted by sink, e.g. when processing a Sweave file.

Value

A copy of rpairs with the weights attached. See the class documentation (RecLinkData, "RLBigDataDedup" and "RLBigDataLinkage") on how weights are stored.

For the "RLBigData" method, the returned object is only a shallow copy in the sense that it links to the same ff data files as database file as rpairs.

Side effects

The "RLBigData" method creates a "ffvector" object, for which a disk file is created.

Author(s)

Andreas Borg, Murat Sariyar

References

P. Contiero et al., The EpiLink record linkage software, in: Methods of Information in Medicine 2005, 44 (1), 66–71.

See Also

epiClassify for classification based on EpiLink weights. emWeights for a different approach for weight calculation.

Examples

# generate record pairs
data(RLdata500)
p=compare.dedup(RLdata500,strcmp=TRUE ,strcmpfun=levenshteinSim,
  identity=identity.RLdata500, blockfld=list("by", "bm", "bd"))

# calculate weights
p=epiWeights(p)

# classify and show results
summary(epiClassify(p,0.6))

RecordLinkage documentation built on Nov. 10, 2022, 5:42 p.m.