sim_dat | R Documentation |
Simulates a new dataset, conforming to a reference dataset as closely as possible By resampling the existing data, only possible values are allowed in simulating a new dataset. This function attempts to minimize error in correlational structure, median, and MAD.
sim_dat(nsim, x, cor_method = "spearman", nTries = 200)
nsim |
Number of cases to simulate |
x |
Input data frame or matrix. Must be numeric or logical |
cor_method |
Minimize error in 'spearman' or 'pearson' correlations? |
nTries |
This number of resamples are iteratively proposed, and the one that best matches the original is retained. |
Error, that is minimized, is defined as:
err <-
sum((orig_correl - cur_correl)^2)*mean(orig_mad) +
sum((orig_med - cur_med)^2)+
sum((orig_mad - cur_mad)^2)
Where orig_correl
and cur_correl
are the original and current correlation matrices.
This is a somewhat arbitrary loss function, but it should do OK for matching new and
old datasets.
Next step: take a groupingVar argument, and simulate from subsets of the data for each unique value of groupingVar, then concatenate those subsets of data (while generating random identifiers for each group? Probably should have an anonymize=T argument as well)
df_raw <- data.frame(x = rnorm(20),y=rep(c(1,2,3,4),5))
df_raw$z <- round((df_raw$x + df_raw$y + rnorm(20,-3,3))*4)/4
df_sim <- sim_dat(20,df_raw)
abs(cor(df_raw) - cor(df_sim))
apply(df_raw,2,mean) ; apply(df_sim,2,mean)
apply(df_raw,2,sd) ; apply(df_sim,2,sd)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.