ganGenerativeData-package: Generate generative data for a data source

ganGenerativeData-packageR Documentation

Generate generative data for a data source

Description

Generative Adversarial Networks are applied to generate generative data for a data source. A generative model consisting of a generator and a discriminator network is trained. During iterative training the distribution of generated data is converging to that of the data source.

Generated data can be written to a file in training and after finished training in a separate generation step. First method accumulates generative data using a dynamic model, second method generates generative data using a static model.

Inserted images show two-dimensional projections of generative data for the iris dataset:

gd34d.png

gd12d.png

gd34ddv.png

gd12ddv.png

Details

The API includes functions for topics "definition of data source" and "generation of generative data". Main function of first topic is dsCreateWithDataFrame() which creates a data source with passed data frame. Main functions of second topic are gdTrain() which trains a generative model for a data source and gdGenerate() which uses a trained generative model to generate generative data. Additionally a software service for accelerated training of generative models is available.

1. Definition of data source

dsCreateWithDataFrame() Create a data source with passed data frame.

dsActivateColumns() Activate columns in a data source in order to include them in training of generative models. By default columns are active.

dsDeactivateColumns() Deactivate columns in a data source in order to exclude them from training of generative models. Note that the training function in the package supports only columns of type R-class numeric, R-type double. All columns of other type have to be deactivated. The training function in the software service for accelerated training of generative models supports columns of any type.

dsGetActiveColumnNames() Get names of active columns of a data source.

dsGetInactiveColumnNames() Get names of inactive columns of a data source.

dsWrite() Write created data source including settings of active columns to a file in binary format. This file will be used as input in functions of topic "generation of generative data".

dsRead() Read a data source from a file that was written with dsWrite().

dsGetNumberOfRows() Get number of rows in a data source.

dsGetRow() Get a row in a data source.

2. Training of generative model and generation of generative data

gdTrainParameters() Specify parameters for training of generative model.

gdTrain() Read a data source from a file, train a generative model that generates generative data for the data source in iterative training steps, write trained generative model and generated data in training steps to a file in binary format..

gdGenerateParameters() Specify parameters for generation of generative data.

gdGenerate() Read a generative model and a data source from a file, generate generative data for the data source and write generated data to a file in binary format.

gdCalculateDensityValues() Read generative data from a file, calculate density values and write generative data with density values to original file.

gdRead() Read generative data and data source from specified files.

gdPlotParameters() Specify plot parameters for generative data.

gdPlotDataSourceParameters() Specify plot parameters for data source.

gdPlotProjection() Create an image file containing two-dimensional projections of generative data and data source.

gdGetNumberOfRows() Get number of rows in generative data.

gdGetRow() Get a row in generative data.

gdCalculateDensityValue() Calculate density value for a data record.

gdDensityValueQuantile() Calculate density value quantile for a percent value.

gdDensityValueInverseQuantile() Calculate inverse density value quantile for a density value.

gdKNearestNeighbors() Search for k nearest neighbors in generative data.

gdComplete() Complete incomplete data record.

gdWriteSubset() Write subset of generative data.

3. Software service for accelerated training of generative models

gdServiceTrain() Send a request to software service to train a generative model.

gdServiceGetGenerativeData() Get generated generative data from software service.

gdServiceGetGenerativeModel() Get trained generative model from software service.

gdServiveGetStatus() Get status of generated job from software service.

gdServiceDelete() Delete generated job from software service.

Author(s)

Werner Mueller

Maintainer: Werner Mueller <werner.mueller5@chello.at>

References

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014), "Generative Adversarial Nets", <arXiv:1406.2661v1>

Examples

# Environment used for execution of examples:

# Operating system: Ubuntu 22.04.1
# Compiler: g++ 11.3.0 (supports C++17 standard)
# R applications: R 4.1.2, RStudio 2022.02.2
# Installed packages: 'Rcpp' 1.0.10, 'tensorflow' 2.11.0,
# 'ganGenerativeData' 2.1.3

# Package 'tensorflow' provides an interface to machine learning framework
# TensorFlow. To complete the installation function install_tensorflow() has to
# be called.
## Not run: 
library(tensorflow)
install_tensorflow()
## End(Not run)

# Generate generative data for the iris dataset

# Load library
library(ganGenerativeData)

# 1. Definition of data source for the iris dataset

# Create a data source with iris data frame.
dsCreateWithDataFrame(iris)

# Deactivate the column with name Species and index 5 in order to exclude it in 
# trainng of generative model.
dsDeactivateColumns(c(5))

# Get the active column names: Sepal.Length, Sepal.Width, Petal.Length,
# Petal.Width.
dsGetActiveColumnNames()

# Write the data source including settings of active columns to file
# "ds.bin" in binary format.
## Not run: 
dsWrite("ds.bin")
## End(Not run)

# 2. Training of generative model and generation of generative data for the iris
# data source

# Read data source from file "ds.bin", train a generative model in iterative
# training steps (used number of iterations in tests is in the range of 10000 to
# 50000), write trained generative model and generated data in training steps to
# files "gm.bin" and "gd.bin".
## Not run: 
gdTrain("gm.bin", "gd.bin", "ds.bin", c(1, 2),
gdTrainParameters(numberOfTrainingIterations = 1000))
## End(Not run)

# Read generative data from file "gd.bin", calculate density values and
# write generative data with density values to original file.
## Not run: 
gdCalculateDensityValues("gd.bin")
## End(Not run)

# Read generative data from file "gd.bin" and data source from "ds.bin". Read in
# data will be accessed in following function calls.
## Not run: 
gdRead("gd.bin", "ds.bin")
## End(Not run)

# Create an image showing two-dimensional projections of generative data and
# data source for column indices 3, 4 and write it to file "gd34d.png".
## Not run: 
gdPlotProjection("gd34d.png",
"Generative Data for the Iris Dataset",
c(3, 4),
gdPlotParameters(50),
gdPlotDataSourceParameters(100))
## End(Not run)

# Create an image showing two-dimensional projections of generative data and 
# data source for column indices 3, 4 with density value threshold 0.71 and
# write it to file "gd34ddv.png".
## Not run: 
gdPlotProjection("gd34ddv.png",
"Generative Data with a Density Value Threshold for the Iris Dataset",
c(3, 4),
gdPlotParameters(50, c(0.38), c("red", "green")),
gdPlotDataSourceParameters(100))
## End(Not run)

# Get number of rows in generative data
## Not run: 
gdGetNumberOfRows()
## End(Not run)

# Get row with index 1000 in generative data
## Not run: 
gdGetRow(1000)
## End(Not run)

# Calculate density value for a data record
## Not run: 
gdCalculateDensityValue(list(6.1, 2.6, 5.6, 1.4))
## End(Not run)

# Calculate density value quantile for 50 percent
## Not run: 
gdDensityValueQuantile(50)
## End(Not run)

# Calculate inverse density value quantile for density value 0.5
## Not run: 
gdDensityValueInverseQuantile(0.5)
## End(Not run)

# Search for k nearest neighbors for a data record 
## Not run: 
gdKNearestNeighbors(list(5.1, 3.5, 1.4, 0.2), 3)
## End(Not run)

# Complete incomplete data record containing an NA value
## Not run: 
gdComplete(list(5.1, 3.5, 1.4, NA))
## End(Not run)

# Write subset containing 50 percent of randomly selected rows of
# generative data
## Not run: 
gdRead("gd.bin")
gdWriteSubset("gds.bin", 50)
## End(Not run)

# 3. Usage of software service for accelerated training of a generative
# model

# Initialize variables for URL and access key.
## Not run: 
url <- "http://xxx.xxx.xxx.xxx/gdService"
accessKey <- "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
## End(Not run)

# Send a request to software service to train a generative model for a data
# source. A job id will be returned.
## Not run: 
trainParameters <- gdTrainParameters(numberOfTrainingIterations = 10000,
numberOfInitializationIterations = 2500)
jobId <- gdServiceTrain(url, accessKey, "gmService.bin", "gdService.bin", "ds.bin",
trainParameters)
## End(Not run)

# Get status of generated job from software service. When job is processed
# successfully status will be set to TRAINED.
## Not run: 
gdServiceGetStatus(url, accessKey, jobId)
## End(Not run)

# Get generated generative data from software service for processed job
## Not run: 
gdServiceGetGenerativeData(url, accessKey, jobId, "gdService.bin")
## End(Not run)

# Get trained generative model from software service for processed job
## Not run: 
gdServiceGetGenerativeModel(url, accessKey, jobId, "gmService.bin")
## End(Not run)

ganGenerativeData documentation built on Oct. 7, 2024, 5:09 p.m.