Methods for stochastic record linkage following the framework of Fellegi and Sunter.
1 2 3 4 5 6 7 8 9 10
## S4 method for signature 'RecLinkData' fsWeights(rpairs, m = 0.95, u = rpairs$frequencies, cutoff = 1) ## S4 method for signature 'RLBigData' fsWeights(rpairs, m=0.95, u=getFrequencies(rpairs), cutoff=1, withProgressBar = (sink.number()==0)) ## S4 method for signature 'RecLinkData' fsClassify(rpairs, ...) ## S4 method for signature 'RLBigData' fsClassify(rpairs, threshold.upper, threshold.lower=threshold.upper, m=0.95, u=getFrequencies(rpairs), withProgressBar = (sink.number()==0), cutoff=1)
The record pairs to be classified.
A numeric value between 0 and 1.
A numeric value between 0 and 1 lower than
Numeric vectors. m- and u-probabilities of matching variables, see Details.
Logical. Whether to display a progress bar.
Numeric value. Threshold for converting string comparison values to binary values.
Arguments passed to emClassify.
These methods perform stochastic record linkage following the framework of Fellegi and Sunter (see reference).
fsWeights calculates matching weights on an object based on the
specified m- and u-probabilities. Each of
u can be a
numeric vector or a single number in the range [0, 1].
fsClassify performs classification based on the calculated weights.
All record pairs with weights greater or
threshold.upper are classified as links. Record pairs with
weights smaller than
threshold.upper and greater or equal
threshold.lower are classified as possible links. All remaining
records are classified as non-links.
"RecLinkData" method is a shortcut for
"RLBigData" method checks if weights are
present in the underlying database. If this is the case, classification
is based on the existing weights. If not, weights are calculated on the fly
during classification, but not stored. The latter behaviour might be preferable
when a very large dataset is to be classified and disk space is limited.
A progress bar is displayed only if
weights are calculated on the fly and, by default, unless output is diverted by
sink (e.g. in a Sweave script).
For a general introduction to weight based record linkage, see the vignette "Weight-based deduplication".
fsWeights returns a copy of the object with the calculated weights
added. Note that
"RLBigData" objects have some
reference-style semantics, see clone for more information.
fsClassify returns a S3 object
"RecLinkResult" that represents a copy
newdata with element
rpairs$prediction, which stores
the classification result, as addendum.
a S4 object of class
Andreas Borg, Murat Sariyar
Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183–1210.
1 2 3 4 5 6 7 8 9
Loading required package: DBI Loading required package: RSQLite Loading required package: ff Loading required package: bit Attaching package bit package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2) creators: bit bitwhich coercion: as.logical as.integer as.bit as.bitwhich which operator: ! & | xor != == querying: print length any all min max range sum summary bit access: length<- [ [<- [[ [[<- for more help type ?bit Attaching package: 'bit' The following object is masked from 'package:base': xor Attaching package ff - getOption("fftempdir")=="/work/tmp/tmp/Rtmpqjct6f" - getOption("ffextension")=="ff" - getOption("ffdrop")==TRUE - getOption("fffinonexit")==TRUE - getOption("ffpagesize")==65536 - getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes - getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system - getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system Attaching package: 'ff' The following objects are masked from 'package:bit': clone, clone.default, clone.list The following objects are masked from 'package:utils': write.csv, write.csv2 The following objects are masked from 'package:base': is.factor, is.ordered Loading required package: ffbase Attaching package: 'ffbase' The following objects are masked from 'package:ff': [.ff, [.ffdf, [<-.ff, [<-.ffdf The following objects are masked from 'package:base': %in%, table RecordLinkage library [c] IMBEI Mainz Attaching package: 'RecordLinkage' The following object is masked from 'package:ff': clone The following object is masked from 'package:bit': clone Deduplication Data Set 500 records 18643 record pairs 50 matches 18593 non-matches 0 pairs with unknown status Weight distribution: [-24,-22] (-22,-20] (-20,-18] (-18,-16] (-16,-14] (-14,-12] (-12,-10] (-10,-8] 9737 3626 4305 0 0 352 362 163 (-8,-6] (-6,-4] (-4,-2] (-2,0] (0,2] (2,4] (4,6] (6,8] 29 0 5 8 10 0 0 0 (8,10] (10,12] (12,14] (14,16] (16,18] (18,20] 33 7 3 0 2 1 56 links detected 0 possible links detected 18587 non-links detected alpha error: 0.040000 beta error: 0.000430 accuracy: 0.999464 Classification table: classification true status N P L FALSE 18585 0 8 TRUE 2 0 48
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.