dup_data | R Documentation |
A dataset containing 867
simulated records from 3
files with
no duplicate records in each file.
dup_data
A list with three elements:
A data.frame
with the records, containing 7
fields, from all three files, in the format used for input to
create_comparison_data
.
The size of each file.
The true partition of the records, represented as an
integer
vector of arbitrary labels of length
sum(file_sizes)
.
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [\Sexpr[results=rd]{tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")}][arXiv]
data(dup_data)
# There are 500 entities represented in the records
length(unique(dup_data$IDs))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.