Synthetic Data Generation and Evaluation
In sdglinkage: Synthetic Data Generation for Linkage Methods Development

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = '../')

library(sdglinkage)
set.seed(1234)

In this vignette, we show how we can use sdglinkage to generate synthetic datasets and compare the performance of the simulated data generated different generators.

Assumption:
We have a real dataset and we would like to generate a synthetic version of it.
Aim:
To generate synthetic dataset using different approaches.
To give a visual comparison of the generated synthetic data and real dataset.
To compare the predictive performance of the generated synthetic data with the real dataset.

Here we use 'Adult' dataset as an example. The Adult dataset was extracted from the US Census database in 1994; it contains 48,842 individual records with 13 personal variables. It is often used as a prediction task to determine whether a person makes over $50,000 a year given personal information. Here we set 70% of the data for training and the rest for evaluation.

adult_data <- split_data(adult, 70)

Generator 1: Bayesian Networks (BN) Learned by Structure Learning Algorithm

First, we need to define some constraints/evidence so that we only generate synthetic data that is realistic in real life. For example, as the name of the dataset 'Adult' suggests, the age of all individuals should be >= 18, and capital_gain should be a positive number, etc...

bn_evidence <- "age >=18 & capital_gain>=0 & capital_loss >=0 & hours_per_week>=0 & hours_per_week<=100"

We use hill-climbing (hc) as our structure Learning algorithm to learn the structure and parameters of our BN simultaneously.

bn_learn <- gen_bn_learn(adult_data$training_set, "hc", bn_evidence)

This is the structure of the learned BN:

plot_bn(bn_learn$structure)

This is the synthetic data sampled from the learned BN:

head(bn_learn$gen_data)

Generator 2: BNs Learned from Expert Knowledge and Data

Here we elicited the dependencies of the variables within the dataset from an expert (the expert is me in this example..just as an example!).

bn_structure <- "[native_country][income][age|marital_status:education][sex][race|native_country][marital_status|race:sex][relationship|marital_status][education|sex:race][occupation|education][workclass|occupation][hours_per_week|occupation:workclass][capital_gain|occupation:workclass:income][capital_loss|occupation:workclass:income]"

We learn the parameters of the elicited BN using maximum likelihood estimation and sample synthetic data based on the previously defined constraints/evidences.

bn_elicit <- gen_bn_elicit(adult_data$training_set, bn_structure, bn_evidence)

This is the structure of the elicited BN:

plot_bn(bn_elicit$structure)

This is the synthetic data sampled from the elicited BN

head(bn_elicit$gen_data)

Generator 3: Classification and Regression Tree (CART)

Here we use the previously elicited structure as our sequence in generating classfication and regression tree for each variables.

cart_elicit <- gen_cart(adult_data$training_set, bn_structure)

This is the synthetic data generated from the elicited CART

head(cart_elicit$gen_data)

This gives a comparision of the synthetic data vs real data from the training set.

compare_cart(adult_data$training_set, cart_elicit$fit_model, c("age", "workclass", "sex"))

Evaluation of the Synthetic Data Generated by These Generators

We compare the synthetic data generated by these three generators with the real data from the training set.

Here is an discrete variable:

plot_compared_sdg(target_var = "race", training_set = adult_data$training_set,
                   syn_data_names = c("CART_elicit", "BN_learn", "BN_elicit"),
                   generated_data1 = cart_elicit$gen_data,
                   generated_data2 = bn_learn$gen_data,
                   generated_data3 = bn_elicit$gen_data)

Here is a continous variable:

plot_compared_sdg(target_var = "age", training_set = adult_data$training_set,
                   syn_data_names = c("CART_elicit", "BN_learn", "BN_elicit"),
                   generated_data1 = cart_elicit$gen_data,
                   generated_data2 = bn_learn$gen_data,
                   generated_data3 = bn_elicit$gen_data)

We assume good quality synthetic data would allow us to draw the same analytic conclusions as we can draw from real data. Hence, we compare the predictive performance of several machine learning algorithms that are trained with the synthetic data and tested by real data with those trained and tested both by real data. We use the variable 'income' as our prediction task to determine whether a person makes over $50,000 a year given personal information

library(mlr)
lrns <- makeLearners(c("rpart", "logreg"), type = "classif",
                     predict.type = "prob")
# lrns <- makeLearners(c("rpart", "logreg", "randomForest"), type = "classif",
#                      predict.type = "prob")
measurements <- list(acc, ber, f1, auc)
bmr <- compare_sdg(lrns, measurement = measurements, target_var = "income",
                      real_dataset = adult_data,
                      generated_data1 = cart_elicit$gen_data,
                      generated_data2 = bn_learn$gen_data,
                      generated_data3 = bn_elicit$gen_data)
names(bmr$results) <- c("Real_dataset", "CART_elicit", "BN_learn", "BN_elicit")

We can see in this example, models trained by data from CART and BN_learn both have very similar predictive performance as those trained by real dataset.