Description Usage Arguments Value Examples
Duplicates are found in c14bazAAR::mark_duplicates()
by comparison of labnrs.
Only dates with exactly equal labnrs are considered duplicates.
Duplicate groups are numbered (from 0) and these numbers linked to
the individual dates in the new column duplicate_group.
While c14bazAAR::mark_duplicates()
only finds duplicates,
c14bazAAR::remove_duplicates()
removes them with three different strategies
according to the value of the arguments preferences
and supermerge
:
Option 1: By merging all dates in a duplicate_group. All non-equal variables
in the duplicate group are turned to NA
. This is the default option.
Option 2: By selecting individual database entries in a duplicate_group
according to a trust hierarchy as defined by the parameter preferences
.
In case of duplicates within one database the first occurrence in the table (top down)
is selected. All databases not mentioned in preferences
are dropped.
Option 3: Like option 2, but in this case the different datasets in a
duplicate_group are merged column by column to
create a superdataset with a maximum of information. The column sourcedb is
dropped in this case to indicate that multiple databases have been merged. Data
citation is a lot more difficult with this option. It can be activated with supermerge
.
The option log
allows to add a new column duplicate_remove_log
that documents the variety of values provided by all databases for this
duplicated date.
c14bazAAR::remove_duplicates()
needs the column duplicate_group
and calls c14bazAAR::mark_duplicates()
if it is missing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | mark_duplicates(x)
## Default S3 method:
mark_duplicates(x)
## S3 method for class 'c14_date_list'
mark_duplicates(x)
remove_duplicates(x, preferences = NULL, supermerge = FALSE, log = TRUE)
## Default S3 method:
remove_duplicates(x, preferences = NULL, supermerge = FALSE, log = TRUE)
## S3 method for class 'c14_date_list'
remove_duplicates(x, preferences = NULL, supermerge = FALSE, log = TRUE)
|
x |
an object of class c14_date_list |
preferences |
character vector with the order of source databases by which the deduping should be executed. If e.g. preferences = c("radon", "calpal") and a certain date appears in radon and euroevol, then only the radon entry remains. Default: NULL. With preferences = NULL all overlapping, conflicting information in individual columns of one duplicated date is removed. See Option 2 and 3. |
supermerge |
boolean. Should the duplicated datasets be merged on the column level? Default: FALSE. See Option 3. |
log |
logical. If log = TRUE, an additional column is added that contains a string documentation of all variants of the information for one date from all conflicting databases. Default = TRUE. |
an object of class c14_date_list with the additional columns duplicate_group or duplicate_remove_log
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | library(magrittr)
test_data <- tibble::tribble(
~sourcedb, ~labnr, ~c14age, ~c14std,
"A", "lab-1", 1100, 10,
"A", "lab-1", 2100, 20,
"B", "lab-1", 3100, 30,
"A", "lab-2", NA, 10,
"B", "lab-2", 2200, 20,
"C", "lab-3", 1300, 10
) %>% as.c14_date_list()
# mark duplicates
test_data %>% mark_duplicates()
# remove duplicates with option 1:
test_data %>% remove_duplicates()
# remove duplicates with option 2:
test_data %>% remove_duplicates(
preferences = c("A", "B")
)
# remove duplicates with option 3:
test_data %>% remove_duplicates(
preferences = c("A", "B"),
supermerge = TRUE
)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.