sim_from_dat: Simulate new data from a dataset's quantiles

View source: R/sim_from_dat.R

sim_from_datR Documentation

Simulate new data from a dataset's quantiles

Description

Simulate new data from a dataset's quantiles

Usage

sim_from_dat(
  nSim,
  datIn,
  minPerClust = 100,
  nTries = 100,
  groups = NA,
  simplify = T
)

Arguments

nSim

Number of simulated observations

datIn

Data to simulate from

minPerClust

[Optional] If empirically-defined subsets of the dataset are desired to be simulated from, these subsets can be identified through K-means clustering. See Details.

nTries

Number of simulated datasets to generate. Only one (i.e., the one with the rank correlation closest to the original dataset's) will be returned.

groups

If desired, a vector can be supplied to split the data and simulate separately given this grouping variable. This may lead to slightly different eventual numbers of simulated values due to rounding errors in the proportions of the total data made up by each group.

simplify

If groups is applied, the simulated data defaults to being merged into a single data frame. However, the simulated data can be returned as a list instead (when simplify = FALSE).

Details

Given an existing dataset's rank correlation structure and the quantiles of the variables in that dataset, a new dataset is simulated from the quantiles and attempting to match the rank correlation of the original.

Sometimes it may be inappropriate to assume that rank correlations would be homogeneous across an entire dataset, and instead there may be subsets of the data that show different patterns than the full dataset (e.g., Simpson's paradox). These can be addressed in two ways: by specifying the groups explicitly (through argument groups) or by empirically estimating group with k-means clustering (or by a combination of both together, with empirical clustering applied to each explicitly-defined group).

If minPerClust is less than the number of observations in datIn (or less than the size of an explicitly-defined group), then k-means clustering (default R kmeans) is used iteratively to determine the maximum number of clusters for which the minimum cluster size is at least minPerClust. If there are at least two clusters satisfying this criterion, then the simulation from quantiles and rank correlations is completed for each identified cluster separately, and these are concatenated in the returned data frame. In general, the smaller the value of minPerClust, [1] the closer the simulated data will be to the original data (including undesirable noise or other idiosyncrasies), and [2] the longer the run time will be. Given that correlations are unlikely to be reliable with small numbers of observations, it would be very strange to have minPerClust be below 20.

Requires the mvtnorm package.

See Also

sim_dat accomplishes something similar, although that function resamples directly from the original data while this one resamples from the quantiles of the original data.

Examples


# note that non-numeric variables will be dropped
new_iris_1 <- sim_from_dat(250,iris) 

# For the iris dataset, this will simulate from separate empirically-defined
# clusters to better reflect the clustered nature of the original data.
new_iris_2 <- sim_from_dat(250,iris, minPerClust = 40) 

# Instead we could simulate by defining the iris Species:
new_iris_3 <- sim_from_dat(250,iris, groups = iris$Species) 


akcochrane/ACmisc documentation built on Nov. 24, 2024, 11:22 a.m.