stochastic: Stochastic record linkage.

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Methods for stochastic record linkage following the framework of Fellegi and Sunter.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## S4 method for signature 'RecLinkData'
fsWeights(rpairs, m = 0.95, u = rpairs$frequencies, cutoff = 1)
## S4 method for signature 'RLBigData'
fsWeights(rpairs, m=0.95, u=getFrequencies(rpairs),
    cutoff=1, withProgressBar = (sink.number()==0))
## S4 method for signature 'RecLinkData'
fsClassify(rpairs, ...)
## S4 method for signature 'RLBigData'
fsClassify(rpairs, threshold.upper, threshold.lower=threshold.upper, 
  m=0.95, u=getFrequencies(rpairs), withProgressBar = (sink.number()==0), cutoff=1)

Arguments

rpairs

The record pairs to be classified.

threshold.upper

A numeric value between 0 and 1.

threshold.lower

A numeric value between 0 and 1 lower than threshold.upper.

m, u

Numeric vectors. m- and u-probabilities of matching variables, see Details.

withProgressBar

Logical. Whether to display a progress bar.

cutoff

Numeric value. Threshold for converting string comparison values to binary values.

...

Arguments passed to emClassify.

Details

These methods perform stochastic record linkage following the framework of Fellegi and Sunter (see reference).

fsWeights calculates matching weights on an object based on the specified m- and u-probabilities. Each of m and u can be a numeric vector or a single number in the range [0, 1].

fsClassify performs classification based on the calculated weights. All record pairs with weights greater or equal threshold.upper are classified as links. Record pairs with weights smaller than threshold.upper and greater or equal threshold.lower are classified as possible links. All remaining records are classified as non-links.

The "RecLinkData" method is a shortcut for emClassify.

The "RLBigData" method checks if weights are present in the underlying database. If this is the case, classification is based on the existing weights. If not, weights are calculated on the fly during classification, but not stored. The latter behaviour might be preferable when a very large dataset is to be classified and disk space is limited. A progress bar is displayed only if weights are calculated on the fly and, by default, unless output is diverted by sink (e.g. in a Sweave script).

For a general introduction to weight based record linkage, see the vignette "Weight-based deduplication".

Value

fsWeights returns a copy of the object with the calculated weights added. Note that "RLBigData" objects have some reference-style semantics, see clone for more information.

For the "RecLinkData" method, fsClassify returns a S3 object of class "RecLinkResult" that represents a copy of newdata with element rpairs$prediction, which stores the classification result, as addendum.

For the "RLBigData" method, fsClassify returns a S4 object of class "RLResult".

Author(s)

Andreas Borg, Murat Sariyar

References

Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183–1210.

See Also

epiWeights

Examples

1
2
3
4
5
6
7
8
9
# generate record pairs
data(RLdata500)
rpairs <- compare.dedup(RLdata500, blockfld=list(1,3,5,6,7), identity=identity.RLdata500)

# calculate weights
rpairs <- fsWeights(rpairs)

# classify and show results
summary(fsClassify(rpairs,0))

Example output

Loading required package: DBI
Loading required package: RSQLite
Loading required package: ff
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: 'bit'

The following object is masked from 'package:base':

    xor

Attaching package ff
- getOption("fftempdir")=="/work/tmp/tmp/Rtmpqjct6f"

- getOption("ffextension")=="ff"

- getOption("ffdrop")==TRUE

- getOption("fffinonexit")==TRUE

- getOption("ffpagesize")==65536

- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writes

- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system

- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system


Attaching package: 'ff'

The following objects are masked from 'package:bit':

    clone, clone.default, clone.list

The following objects are masked from 'package:utils':

    write.csv, write.csv2

The following objects are masked from 'package:base':

    is.factor, is.ordered

Loading required package: ffbase

Attaching package: 'ffbase'

The following objects are masked from 'package:ff':

    [.ff, [.ffdf, [<-.ff, [<-.ffdf

The following objects are masked from 'package:base':

    %in%, table

RecordLinkage library
[c] IMBEI Mainz


Attaching package: 'RecordLinkage'

The following object is masked from 'package:ff':

    clone

The following object is masked from 'package:bit':

    clone


Deduplication Data Set

500 records 
18643 record pairs 

50 matches
18593 non-matches
0 pairs with unknown status


Weight distribution:

[-24,-22] (-22,-20] (-20,-18] (-18,-16] (-16,-14] (-14,-12] (-12,-10]  (-10,-8] 
     9737      3626      4305         0         0       352       362       163 
  (-8,-6]   (-6,-4]   (-4,-2]    (-2,0]     (0,2]     (2,4]     (4,6]     (6,8] 
       29         0         5         8        10         0         0         0 
   (8,10]   (10,12]   (12,14]   (14,16]   (16,18]   (18,20] 
       33         7         3         0         2         1 

56 links detected 
0 possible links detected 
18587 non-links detected 

alpha error: 0.040000
beta error: 0.000430
accuracy: 0.999464


Classification table:

           classification
true status     N     P     L
      FALSE 18585     0     8
      TRUE      2     0    48

RecordLinkage documentation built on Aug. 25, 2020, 5:07 p.m.