compare: Compare Records

compareR Documentation

Compare Records

Description

Builds comparison patterns of record pairs for deduplication or linkage.

Usage

compare.dedup (dataset, blockfld = FALSE, phonetic = FALSE, 
  phonfun = soundex, strcmp = FALSE, strcmpfun = jarowinkler, exclude = FALSE,
  identity = NA, n_match = NA, n_non_match = NA)

compare.linkage (dataset1, dataset2, blockfld = FALSE, 
  phonetic = FALSE, phonfun = soundex, strcmp = FALSE, 
  strcmpfun = jarowinkler, exclude = FALSE, identity1 = NA, identity2 = NA,
  n_match = NA, n_non_match = NA)

Arguments

dataset

Table of records to be deduplicated. Either a data frame or a matrix.

dataset1, dataset2

Two data sets to be linked.

blockfld

Blocking field definition. A list of integer or character vectors with column indices or FALSE to disable blocking. See details and examples.

phonetic

Determines usage of a phonetic code. If FALSE, no phonetic code will be used; if TRUE, the phonetic code will be used for all columns; if a numeric or character vector is given, the phonetic code will be used for the specified columns.

phonfun

Function for phonetic code. See details.

strcmp

Determines usage of a string metric. Used in the same manner as phonetic

strcmpfun

User-defined function for string metric. See details.

exclude

Columns to be excluded. A numeric or character vector specifying the columns which should be excluded from comparison

identity, identity1, identity2

Optional numerical vectors for identifying matches and non-matches. In a deduplication process, two records dataset[i,] and dataset[j,] are a true match if and only if identity[i,]==identity[j,]. In a linkage process, two records dataset1[i,] and dataset2[j,] are a true match if and only if
identity1[i,]==identity2[j,].

n_match, n_non_match

Number of desired matches and non-matches in the result.

Details

These functions build record pairs and finally comparison patterns by which these pairs are later classified as links or non-links. They make up the initial stage in a Record Linkage process after possibly normalizing the data. Two general scenarios are reflected by the two functions: compare.dedup works on a single data set which is to be deduplicated, compare.linkage is intended for linking two data sets together.

Data sets are represented as data frames or matrices (typically of type character), each row representing one record, each column representing one field or attribute (like first name, date of birth...). Row names are not retained in the record pairs. If an identifier other than row number is needed, it should be supplied as a designated column and excluded from comparison (see note on exclude below).

Each element of blockfld specifies a set of columns in which two records must agree to be included in the output. Each blocking definition in the list is applied individually, the sets obtained thereby are combined by a union operation. If blockfld is FALSE, no blocking will be performed, which leads to a large number of record pairs (n*(n-1)/2 where n is the number of records).

As an alternative to blocking, a determined number of n_match matches and n_non_match non-matches can be drawn if identity or identity1 and identity2 are supplied. This is relevant for generating training sets for the supervised classificators (see trainSupv).

Fields can be excluded from the linkage process by supplying their column index in the vector exclude, which is especially useful for external identifiers. Excluded fields can still be used for blocking, also with phonetic code.

Phonetic codes and string similarity measures are supported for enhanced detection of misspellings. Applying a phonetic code leads to a binary values, where 1 denotes equality of the generated phonetic code. A string comparator leads to a similarity value in the range [0,1].

String comparison is not allowed on a field for which a phonetic code is generated. For phonetic encoding functions included in the package, see phonetics. For the included string comparators, see jarowinkler and levenshteinSim.

Please note that phonetic code and string metrics can slow down the generation of comparison patterns significantly.

User-defined functions for phonetic code and string comparison can be supplied via the arguments phonfun and strcmpfun. phonfun is expected to have a single character argument (the string to be transformed) and must return a character value with the encoded string.

strcmpfun must have as arguments the two strings to be compared and return a similarity value in the range [0,1], with 0 denoting the lowest and 1 denoting the highest degree of similarity. Both functions must be fully vectorized to work on matrices.

Value

An object of class RecLinkPairs with the following components:

data

Copy of the records, converted to a data frame.

pairs

Generated comparison patterns.

frequencies

For each column included in pairs, the average frequency of values (reciprocal of number of distinct values).

Author(s)

Andreas Borg, Murat Sariyar

See Also

RecLinkData for the format of returned objects.

Examples

data(RLdata500)
data(RLdata10000)

# deduplication without blocking, use string comparator on names
## Not run: rpairs=compare.dedup(RLdata500,strcmp=1:4)
# linkage with blocking on first name and year of birth, use phonetic
# code on first components of first and last name

## Not run: rpairs=compare.linkage(RLdata500,RLdata10000,blockfld=c(1,7),phonetic=c(1,3))
# deduplication with blocking on either last name or complete date of birth,
# use string comparator on all fields, include identity information
## Not run: rpairs=compare.dedup(RLdata500, identity=identity.RLdata500, strcmp=TRUE,
  blockfld=list(1,c(5,6,7)))
## End(Not run)

# Draw 100 matches and 1000 non-matches
## Not run: rpairs=compare.dedup(RLdata10000,identity=identity.RLdata10000,n_match=100,
  n_non_match=10000)
## End(Not run)

RecordLinkage documentation built on Nov. 10, 2022, 5:42 p.m.