sim_from_dat: Simulate new data from a dataset's quantiles
In akcochrane/ACmisc: Aaron Cochrane's miscellaneous functions

sim_from_dat

R Documentation

Simulate new data from a dataset's quantiles

Description

Simulate new data from a dataset's quantiles

Usage

sim_from_dat(
  nSim,
  datIn,
  minPerClust = 100,
  nTries = 100,
  groups = NA,
  simplify = T
)

Arguments

`nSim`	Number of simulated observations
`datIn`	Data to simulate from
`minPerClust`	[Optional] If empirically-defined subsets of the dataset are desired to be simulated from, these subsets can be identified through K-means clustering. See Details.
`nTries`	Number of simulated datasets to generate. Only one (i.e., the one with the rank correlation closest to the original dataset's) will be returned.
`groups`	If desired, a vector can be supplied to split the data and simulate separately given this grouping variable. This may lead to slightly different eventual numbers of simulated values due to rounding errors in the proportions of the total data made up by each group.
`simplify`	If `groups` is applied, the simulated data defaults to being merged into a single data frame. However, the simulated data can be returned as a list instead (when `simplify = FALSE`).

Details

Given an existing dataset's rank correlation structure and the quantiles of the variables in that dataset, a new dataset is simulated from the quantiles and attempting to match the rank correlation of the original.

Sometimes it may be inappropriate to assume that rank correlations would be homogeneous across an entire dataset, and instead there may be subsets of the data that show different patterns than the full dataset (e.g., Simpson's paradox). These can be addressed in two ways: by specifying the groups explicitly (through argument groups) or by empirically estimating group with k-means clustering (or by a combination of both together, with empirical clustering applied to each explicitly-defined group).

If minPerClust is less than the number of observations in datIn (or less than the size of an explicitly-defined group), then k-means clustering (default R kmeans) is used iteratively to determine the maximum number of clusters for which the minimum cluster size is at least minPerClust. If there are at least two clusters satisfying this criterion, then the simulation from quantiles and rank correlations is completed for each identified cluster separately, and these are concatenated in the returned data frame. In general, the smaller the value of minPerClust, [1] the closer the simulated data will be to the original data (including undesirable noise or other idiosyncrasies), and [2] the longer the run time will be. Given that correlations are unlikely to be reliable with small numbers of observations, it would be very strange to have minPerClust be below 20.

Requires the mvtnorm package.

Examples


# note that non-numeric variables will be dropped
new_iris_1 <- sim_from_dat(250,iris) 

# For the iris dataset, this will simulate from separate empirically-defined
# clusters to better reflect the clustered nature of the original data.
new_iris_2 <- sim_from_dat(250,iris, minPerClust = 40) 

# Instead we could simulate by defining the iris Species:
new_iris_3 <- sim_from_dat(250,iris, groups = iris$Species)

akcochrane/ACmisc documentation built on Nov. 24, 2024, 11:22 a.m.