damage_gold_standard: Generate a linkage file by damaging the gold standard file.

Description Usage Arguments Value Examples

View source: R/gen_linkage_file.R

Description

damage_gold_standard damage the gold_standard file into a linkage files. The damage actions are instructued by the error flags in syn_error_occurrence. These actions are:

  1. missing: assign 'NA' to the flagged data point;

  2. del: randomly delete one charater on the flagged data point;

  3. trans_char: randomly transpose two neighbouring characters on the flagged data point;

  4. trans_date: randomly transpose the day and the month of a date on the flagged data point;

  5. insert: randomly insert one charater to the flagged data point;

  6. typo: randomly assign a typo error to the flagged data point;

  7. ocr: randomly assign a ocr error to the flagged data point;

  8. pho: randomly assign a phonetic error to the flagged data point;

  9. variant: randomly assign a name variant to the flagged data point.

Usage

1
damage_gold_standard(gold_standard, syn_error_occurrence)

Arguments

gold_standard

A data frame of the gold standard dataset, see add_variable.

syn_error_occurrence

A data frame of one-hot encoded error flags, see bn_flag_inference.

Value

A list of two data frame: i) the linkage_file having the same dimension as the gold_standard but some of the variables are damaged; ii) the error_log records the damages have made on the linkage file.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
adult_with_flag <- add_random_error(adult[1:50,], prob = c(0.97, 0.03), "age_missing")
adult_with_flag <- add_random_error(adult_with_flag, prob = c(0.65, 0.35), "firstname_variant")
adult_with_flag <- split_data(adult_with_flag, 70)
bn_evidence <- "age >=18 & capital_gain>=0 & capital_loss >=0 &
                hours_per_week>=0 & hours_per_week<=100"
bn_learn <- gen_bn_learn(adult_with_flag$training_set, "hc", bn_evidence)
dataset_smaller_version <- bn_learn$gen_data
syn_dependent <- dataset_smaller_version[, !grepl("flag", colnames(dataset_smaller_version))]
gold_standard <- add_variable(syn_dependent, "firstname", country = "uk",
                              gender_dependency = TRUE, age_dependency = TRUE)
syn_error_occurrence <- bn_flag_inference(dataset_smaller_version, bn_learn$fit_model)
linkage_file <- damage_gold_standard(gold_standard, syn_error_occurrence)

sdglinkage documentation built on April 27, 2020, 5:09 p.m.