sim_from_dat | R Documentation |
Simulate new data from a dataset's quantiles
sim_from_dat(
nSim,
datIn,
minPerClust = 100,
nTries = 100,
groups = NA,
simplify = T
)
nSim |
Number of simulated observations |
datIn |
Data to simulate from |
minPerClust |
[Optional] If empirically-defined subsets of the dataset are desired to be simulated from, these subsets can be identified through K-means clustering. See Details. |
nTries |
Number of simulated datasets to generate. Only one (i.e., the one with the rank correlation closest to the original dataset's) will be returned. |
groups |
If desired, a vector can be supplied to split the data and simulate separately given this grouping variable. This may lead to slightly different eventual numbers of simulated values due to rounding errors in the proportions of the total data made up by each group. |
simplify |
If |
Given an existing dataset's rank correlation structure and the quantiles of the variables in that dataset, a new dataset is simulated from the quantiles and attempting to match the rank correlation of the original.
Sometimes it may be inappropriate to assume that rank correlations would be
homogeneous across an entire dataset, and instead there may be subsets of the
data that show different patterns than the full dataset (e.g., Simpson's paradox).
These can be addressed in two ways: by specifying the groups explicitly (through
argument groups
) or by empirically estimating group with k-means clustering
(or by a combination of both together, with empirical clustering applied to each
explicitly-defined group).
If minPerClust
is less than the number of observations in datIn
(or less than the size of an explicitly-defined group),
then k-means clustering (default R kmeans
) is used iteratively to
determine the maximum number of clusters for which the minimum cluster size
is at least minPerClust
. If there are at least two clusters satisfying
this criterion, then the simulation from quantiles and rank correlations is
completed for each identified cluster separately, and these are concatenated in the
returned data frame. In general, the smaller the value of minPerClust
,
[1] the closer the simulated data will be to the original data (including
undesirable noise or other idiosyncrasies), and [2] the longer the run time
will be. Given that correlations are unlikely to be reliable with small numbers
of observations, it would be very strange to have minPerClust
be below
20.
Requires the mvtnorm
package.
sim_dat
accomplishes something similar, although that function resamples directly from the
original data while this one resamples from the quantiles of the original data.
# note that non-numeric variables will be dropped
new_iris_1 <- sim_from_dat(250,iris)
# For the iris dataset, this will simulate from separate empirically-defined
# clusters to better reflect the clustered nature of the original data.
new_iris_2 <- sim_from_dat(250,iris, minPerClust = 40)
# Instead we could simulate by defining the iris Species:
new_iris_3 <- sim_from_dat(250,iris, groups = iris$Species)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.