View source: R/create_comparison_data.R
create_comparison_data | R Documentation |
Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.
create_comparison_data(
records,
types,
breaks,
file_sizes,
duplicates,
verbose = TRUE
)
records |
A |
types |
A |
breaks |
A |
file_sizes |
A |
duplicates |
A |
verbose |
A |
The purpose of this function is to construct comparison vectors for each pair
of records. In order to construct these vectors, one needs to specify the
types
and breaks
arguments. The types
argument specifies
how each field should be compared, and the breaks
argument specifies
how to discretize these comparisons.
Currently, the types
argument supports three types of field
comparisons: binary, absolute difference, and the normalized Levenshtein
distance. Please contact the package maintainer if you need a new type of
comparison to be supported.
The breaks
argument should be a list
, with with one element for
each field. If a field is being compared with a binary comparison, i.e.
types[f]="bi"
, then the corresponding element of breaks
should
be NA
, i.e. breaks[[f]]=NA
. If a field is being compared with a
numeric or string comparison, then the corresponding element of breaks
should be a vector of cut points used to discretize the comparisons. To give
more detail, suppose you pass in cut points
breaks[[f]]=c(cut_1, ...,cut_L)
. These cut points
discretize the range of the comparisons into L+1
intervals:
I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]
. The
raw comparisons, which lie in [0,\infty)
for numeric comparisons and
[0,1]
for string comparisons, are then replaced with indicators of
which interval the comparisons lie in. The interval I_0
corresponds to
the lowest level of disagreement for a comparison, while the interval
I_L
corresponds to the highest level of disagreement for a comparison.
a list containing:
record_pairs
A data.frame
, where each row
contains the pair of records being compared in the corresponding row of
comparisons
. The rows are sorted in ascending order according to the
first column, with ties broken according to the second column in ascending
order. For any given row, the first column is less than the second column,
i.e. record_pairs[i, 1] < record_pairs[i, 2]
for each row i
.
comparisons
A logical
matrix, where each row contains
the comparisons for the record pair in the corresponding row of
record_pairs
. Comparisons are in the same order as the columns of
records
, and are represented by L + 1
columns of
TRUE/FALSE
indicators, where L + 1
is the number of
disagreement levels for the field based on breaks
.
K
The number of files, assumed to be of class
numeric
.
file_sizes
A numeric
vector of length K
,
indicating the size of each file.
duplicates
A numeric
vector of length K
,
indicating which files are assumed to have duplicates. duplicates[k]
should be 1
if file k
has duplicates, and
duplicates[k]
should be 0
if file k
has no duplicates.
If any files do not have duplicates, we strongly recommend that the largest
such file is organized to be the first file.
field_levels
A numeric
vector indicating the number of
disagreement levels for each field.
file_labels
An integer
vector of length
sum(file_sizes)
, where file_labels[i]
indicates which file
record i
is in.
fp_matrix
An integer
matrix, where
fp_matrix[k1, k2]
is a label for the file pair (k1, k2)
. Note
that fp_matrix[k1, k2] = fp_matrix[k2, k1]
.
rp_to_fp
A logical
matrix that indicates which record
pairs belong to which file pairs. rp_to_fp[fp, rp]
is TRUE
if
the records record_pairs[rp, ]
belong to the file pair fp
,
and is FALSE otherwise. Note that fp
is given by the labeling in
fp_matrix
.
ab
An integer
vector, of length
ncol(comparisons) * K * (K + 1) / 2
that indicates how many record
pairs there are with a given disagreement level for a given field, for each
file pair.
file_sizes_not_included
A numeric
vector of 0
s.
This element is non-zero when reduce_comparison_data
is
used.
ab_not_included
A numeric
vector of 0
s. This
element is non-zero when reduce_comparison_data
is used.
labels
NA
. This element is not NA
when
reduce_comparison_data
is used.
pairs_to_keep
NA
. This element is not NA
when
reduce_comparison_data
is used.
cc
0
. This element is non-zero when
reduce_comparison_data
is used.
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [\Sexpr[results=rd]{tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")}][arXiv]
## Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
## Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.