knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The package Conjurer offers synthetic data distribution functionality to generate data that seems real. To that extent, the functions in this package help generate distributions in a parametric method. This means that the randomness of the data generation is preserved while allowing the user to define the constraints of the randomness. Such a controlled randomness will aid in the generation of multiple data distributions to simulate real world as well as unrealistic examples of data. This paper provides insights and usage of the functions in a more detailed manner than provided in the manual of the package. This paper presents each function as a sub section and provides an overview of the purpose and details examples with source code.
The function buildNum is used to generate continuous data distribution. The continuous data in the context of this package relates to the float data type and not continuous in the context of signal processing. Although the data distribution generated is a float data type, this can be rounded off to simulate discrete data distribution. At the core, this function uses a modified form of sine curve and therefore lends itself to manipulation such that the dispersion of the data can be skewed on purpose. The dispersion of the data can be controlled by the parameter disp which takes a value between (-pi/2) and (pi/2). In order to make the data more realistic, the parameter outliers can be set to 1. It must be noted that the outliers may produce results where data could be beyond the range of the data requested i.e. st and en This functionality can be used to generate univariate distributions.
The following code illustrates the process of generating continuous data with and without outliers.
#invoke library library("conjurer") set.seed(123) continuousData <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 0) continuousDataOutlier <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 1) par(mfrow=c(1,2)) plot(continuousData) plot(continuousDataOutlier)
The function buildName is used to generate string data. This function uses probabilistic distribution of the alpabet sequences. Unlike more advanced algorithms such as conditional random fields, this function uses a more basic approach of probability of an alphabet given the probability of the alplhabet preceding it. To this extent, the function sources a data frame of string data based on which the posterior probabilities are generated. Since the generation is based on posterior probabilities, there needs to be sufficiently large data frame such that all possible permutations of the alphabets are present. If no data frame is provided, a default data frame of predetermined set of baby names is used.
The following code illustrates the process of generating of alphabet sequences based on the default data frame provided in the package as well as a mocked up data of three short parts of a ficticious genome sequence.
#invoke library library("conjurer") set.seed(123) buildNames(numOfNames = 3, minLength = 5, maxLength = 7) d <- data.frame (first_column = c("ATGACGAGAGAGAGCA", "ATGACGAGAGAGCAGAGA","TACTGCTCTCTCGTAAATCG")) buildNames(dframe=d, numOfNames = 3, minLength = 5, maxLength = 5)
Note: It can be observed that since the data frame of genome sequences is small, the package throws a warning that there is not enough training data
The function buildId is used to generate the alphanumeric. In its current state the alphanumeric is a sequence of data with a string prefix followed by an incremental numeric data. This data can be used a unique identifier of an element or in cases of database schema, this can be used as a primary key of a table.
The following code illustrates the process of generating a unique specimen id for a given number of elements.
#invoke library library("conjurer") buildId(numOfItems = 3, prefix = "specID")
The function buildPattern is used to generate a sequence i.e. a predetermined pattern of data. This function can be considered as an intuitive form of finite state automaton or a regular expression. A pattern is built as a probabilistic combination of parts.
The following code illustrates the process of generating a pattern of phone numbers and IP addresses. The parts are generated based on the respective probabilities given in the probs.
#invoke library library("conjurer") set.seed(123) parts <- list(c(172),c("."),c(16:31), c("."), c(0:255), c("."), c(0:255)) probs <- list(c(), c(),c(),c(), c(), c(), c()) buildPattern(n=5,parts = parts, probs = probs) parts <- list(c("+11","+44","+64"), c("-"), c(491,324,211), c(7821:8324)) probs <- list(c(0.25,0.25,0.50), c(), c(0.30,0.60,0.10), c()) buildPattern(n=5,parts = parts, probs = probs)
The function buildHierarchy is used to generate graph data i.e. hierarchical data. Based on the number of levels and splits, the tree structure is built. The graph data is then presented in the form of a data frame.
The following code illustrates the process of generating a tree with 2 splits at each node and a depth of three levels.
#invoke library library("conjurer") buildHierarchy(splits = 2, numOfLevels = 3)
The function buildPareto is used to map data elements to each other. This function helps in mapping or linking variables. Such a linking or mapping helps in multiple use cases such as build a data frame from a set of variables, building data distribution of one variable in relation to another.
The following code illustrates the process of generating a mapping between two factors such that 30 percent of one factor is linked to 70 percent of another factor.
#invoke library library("conjurer") set.seed(123) f1 <- factor(c(1:10)) f2 <- factor(letters[1:12], labels = "f") buildPareto(factor1 = f1, factor2 = f2, pareto = c(70,30))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.