Scalelink: Record linkage via scaling algorithm

Description Details References Maintainer Author(s) Examples

Description

Scalelink is an R command to perform 'probabilistic' linkage of two data files using a scaling procedure.

Details

With increasing availability of large data sets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. Variation in the quality of identifiers used to carry out linkage means that existing approaches are often based upon 'probabilistic' models, which are based on a number of assumptions, and can make heavy computational demands. This package implements the method proposed in Goldstein, H., Harron, K. and Cortina-Borja, M. (2017). In this paper we suggest a new approach to classifying record pairs in linkage, based upon weights (scores) derived using a scaling algorithm. The proposed method does not rely on training data, is computationally fast, requires only moderate amounts of storage and has intuitive appeal.

References

Scalelink

Goldstein, H., Charlton, C.M.J. (2017) Scalelink: A Package to link data via scaling.

Paper

Goldstein, H., Harron, K. and Cortina-Borja, M. (2017). A scaling approach to record linkage. Statistics in Medicine. DOI: 10.1002/sim.7287

Maintainer

Chris Charlton c.charlton@bristol.ac.uk

Author(s)

Charlton, C.M.J., Goldstein H (2017) Centre for Multilevel Modelling, University of Bristol.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
library(Scalelink)

## Set the number of CPU cores to use (omit to use all available)
RcppParallel::setThreadOptions(numThreads = 2)

data(FOI, package = "Scalelink")
data(LDFCOMP, package = "Scalelink")

idcols <- c("Day", "Month", "Year", "Sex")
result <- calcScores(FOI[, idcols], LDFCOMP[, idcols])

print(result$scores)

## Scalelink package provides two examples using synthetic data
## one with complete data and one containing missing values

## Not run: 
## For a list of demo titles
demo(package = 'Scalelink')

## To run a demo
demo(Example1)

## Using your own data
##If you had the following files in your working directory:
##FOI:
##A space-delimited file of interest (NFOI x PFOI). NFOI is number of records
##IDENTIFIERS_FOI:
##A space-delimited file containing a row vector length PFOI with a 1 where it is an identifier
##LDF:
##A space-delimited linking data file (NLDF x PLDF). NLDF is number of records
##IDENTIFIERS_LDF:
##A space-delimited file containing a row vector length PLDF with a 1 where it is an identifier

##Then you can calculate scores as follows:
FOI<-read.table("FOI")
LDF<-read.table("LDF")
IDENTIFIERS_FOI<-read.table('IDENTIFIERS_FOI')
IDENTIFIERS_LDF<-read.table('IDENTIFIERS_LDF')
result <- calcScores(FOI[, which(IDENTIFIERS_FOI == 1)], LDF[, which(IDENTIFIERS_LDF == 1)], 
missing.value=-9.999e+029)

##To view the scores:
print(round(result$scores, 2))

##To view the A* matrix:
print(result$astar)

## End(Not run)

Scalelink documentation built on May 2, 2019, 8:29 a.m.