RLBigDataDedup | R Documentation |
These are constructors which initialize a record linkage setup for
big datasets, either deduplication of one (RLBigDataDedup
)
or linkage of two datasets (RLBigDataLinkage
).
RLBigDataDedup(dataset, identity = NA, blockfld = list(), exclude = numeric(0), strcmp = numeric(0), strcmpfun = "jarowinkler", phonetic = numeric(0), phonfun = "soundex") RLBigDataLinkage(dataset1, dataset2, identity1 = NA, identity2 = NA, blockfld = list(), exclude = numeric(0), strcmp = numeric(0), strcmpfun = "jarowinkler", phonetic = numeric(0), phonfun = "soundex")
dataset, dataset1, dataset2 |
Table of records to be deduplicated or linked. Either a data frame or a matrix. |
identity, identity1, identity2 |
Optional vectors (are converted to
factors) for identifying true matches and
non-matches. In a deduplication process, two records |
blockfld |
Blocking field definition. A numeric or character vector or a list of several such vectors, corresponding to column numbers or names. See details and examples. |
exclude |
Columns to be excluded. A numeric or character vector corresponding to columns of dataset or dataset1 and dataset2 which should be excluded from comparison |
strcmp |
Determines usage of string comparison. If |
strcmpfun |
Character string representing the string comparison function. Possible values are |
phonetic |
Determines usage of phonetic code. Used in the same manner as |
.
phonfun |
Character string representing the phonetic function. Currently, only |
These functions act as constructors for the S4 classes
"RLBigDataDedup"
and "RLBigDataLinkage"
.
They make up the initial stage in a Record Linkage process using
large data sets (>= 1.000.000 record pairs) after possibly
normalizing the data. Two general
scenarios are reflected by the two functions: RLBigDataDedup
works on a
single data set which is to be deduplicated, RLBigDataLinkage
is intended
for linking two data sets together. Their usage follows the functions
compare.dedup
and compare.linkage
, which are recommended
for smaller amounts of data, e.g. training sets.
Datasets are represented as data frames or matrices (typically of type
character), each row representing one record, each column representing one
attribute (like first name, date of birth,...). Row names are not
retained in the record pairs. If an identifier other than row number is
needed, it should be supplied as a designated column and excluded from
comparison (see note on exclude
below).
In case of RLBigDataLinkage
, the two datasets must have the same number
of columns and it is assumed that their column classes and semantics match.
If present, the column names of dataset1
are assigned to dataset2
in order to enforce a matching format. Therefore, column names used in
blockfld
or other arguments refer to dataset1
.
Each element of blockfld
specifies a set of columns in which two
records must agree to be included in the output. Each blocking definition in
the list is applied individually, the sets obtained
thereby are combined by a union operation.
If blockfld
is FALSE
, no blocking will be performed,
which leads to a large number of record pairs
(n*(n-1)/2 where n is the number of
records).
Fields can be excluded from the linkage process by supplying their column
index in the vector exclude
, which is especially useful for
external identifiers. Excluded fields can still be used for
blocking, also with phonetic code.
Phonetic codes and string similarity measures are supported for enhanced detection of misspellings. Applying a phonetic code leads to binary similarity values, where 1 denotes equality of the generated phonetic code. A string comparator leads to a similarity value in the range [0,1]. Using string comparison on a field for which a phonetic code is generated is possible, but issues a warning.
In contrast to the compare.*
functions, phonetic coding and string
comparison is not carried out in R, but by database functions. Supported
functions are "soundex"
for phonetic coding and "jarowinkler"
and
"levenshtein"
for string comparison. See the documentation for their
R equivalents (phonetic functions,
string comparison) for further information.
An object of class "RLBigDataDedup"
or
"RLBigDataLinkage"
, depending on the called function.
The RSQLite database driver is initialized via dbDriver("SQLite")
and a connection established and stored in the returned object. Extension
functions for phonetic code and string comparison are loaded into the database.
The records in dataset
or dataset1
and dataset2
are stored in tables
"data"
or "data1"
and "data2"
, respectively, and
indices are created on all columns involved in blocking.
Andreas Borg, Murat Sariyar
"RLBigDataDedup"
, "RLBigDataLinkage"
,
compare.dedup
, compare.linkage
,
the vignette "Classes for record linkage of big data sets".
data(RLdata500) data(RLdata10000) # deduplication without blocking, use string comparator on names rpairs <- RLBigDataDedup(RLdata500, strcmp = 1:4) # linkage with blocking on first name and year of birth, use phonetic # code on first components of first and last name rpairs <- RLBigDataLinkage(RLdata500, RLdata10000, blockfld = c(1, 7), phonetic = c(1, 3)) # deduplication with blocking on either last name or complete date of birth, # use string comparator on all fields, include identity information rpairs <- RLBigDataDedup(RLdata500, identity = identity.RLdata500, strcmp=TRUE, blockfld = list(1, c(5, 6, 7)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.