compare_sdg: Compare the performance of generators.
In sdglinkage: Synthetic Data Generation for Linkage Methods Development

Description Usage Arguments Details Value Examples

compare_sdg compares the preditive performance of models trained by synthetic data with model trained by real data.

compare_sdg(
  learner,
  measurement,
  target_var,
  real_dataset,
  generated_data1,
  generated_data2 = NA,
  generated_data3 = NA,
  generated_data4 = NA,
  generated_data5 = NA,
  generated_data6 = NA
)

`learner`	A learner object from `makeLearners`.
`measurement`	A list of performance measurements for `benchmark`.
`target_var`	A string of the response variable name.
`real_dataset`	A list of data frames with a training_set data frame and a testing_set data frame. You can get this list from `split_data`.
`generated_data1`	A data frame of synthetic data 1.
`generated_data2`	A data frame of synthetic data 2.
`generated_data3`	A data frame of synthetic data 3.
`generated_data4`	A data frame of synthetic data 4.
`generated_data5`	A data frame of synthetic data 5.
`generated_data6`	A data frame of synthetic data 6.

This function returns the measured performance of predictive models trained by the synthetic data. We assume good quality synthetic data would allow us to draw the same analytic conclusions as we can draw from real data. Hence, we compare the predictive performance of several machine learning algorithms that are trained with the synthetic data and tested by real data with those trained and tested both by real data.

The output is a benchmark object. It compares the the preditive performance of selected models trained by the real data and validated by the testing data with models trained by the generated data and validated by the testing data.

library(mlr)
adult_data <- adult[c('age', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
                      'income')]
adult_data <- split_data(adult_data[1:100,], 70)
bn_learn <- gen_bn_learn(adult_data$training_set, "hc")
lrns <- makeLearners(c("rpart", "logreg"), type = "classif",predict.type = "prob")
measurements <- list(acc, ber)
bmr <- compare_sdg(lrns,
    measurement = measurements,
    target_var = "income",
    real_dataset = adult_data,
    generated_data1 = bn_learn$gen_data)
names(bmr$results) <- c("real_dataset","bn_learn")
bmr