fakeR
In fakeR: Simulates Data from a Data Frame of Different Variable Types

Motivation

As a response to concerns of anonymity and user privacy when releasing datasets for public use, fakeR is a package created to help allow users to simulate from an existing dataset. The package allows for simulating datasets of various variable types. This includes datasets containing categorical and quantitative variables as well as datasets of clustered time series observations. The package functions are also useful for maintaining a similar structure of missingness if one is to exist in the existing dataset.

One potential workflow for anonymization using this package would be to simulate fake data from the existing dataset to release to the public. From there, give others the opportunity to run analyses on the fake data and privately share their scripts to be rerun by the data owner on the real dataset. This procedure protects the anonymity of the individuals while allowing the analyses to be run on the real data for accurate end results. The amount of information from the original dataset to be shared in the simulated version can be specified, from approximate distribution .including covariances, between variables to the variable type only, with the data encoded with random numbers. Further research is currently being done to test and analyze such a method.

Examples

Simulate from time-independent data frame of multiple types

library(datasets)
library(fakeR)
library(stats)

# single column of an unordered, string factor
state_df <- data.frame(division=state.division)
# character variable
state_df$division <- as.character(state_df$division)
# numeric variable
state_df$area <- state.area
# factor variable
state_df$region <- state.region
state_sim <- simulate_dataset(state_df)

Notice how the function prints the variable types is notices while it is generating the simulated data.

head(state_df)
head(state_sim)

It is important to note that the multivariate normal assumption for generating numeric and ordered factor data is not always appropriate given the original data.

Simulate from time-independent data frame with missingness & independence between variables

df <- mtcars
# change one of the variable types to an unordered factor
df$carb <- as.factor(df$carb)
# change another variable type to an ordered factor
df$gear <- as.ordered(as.factor(df$gear))
df[2,] <- NA
sim_df <- simulate_dataset(df, stealth.level=2, ignore='mpg', use.miss=TRUE)

Simulate from time-dependent dataframe

## time series dataframe
tree_ring <- data.frame(treering)
tree_ring$year <- c(1: nrow(tree_ring))
sim_tree_ring <- simulate_dataset_ts(tree_ring, 
                                     cluster="treering", 
                                     time.variable="year")

plot (tree_ring$year, tree_ring$treering, type='l', 
      main=paste("Original","Normalized ring width"),
      ylab="Ring width", xlab="Year index")
plot (tree_ring$year, tree_ring$treering, type='l', 
      main=paste("Simulated","Normalized ring width"),
      ylab="Ring width", xlab="Year index")