sim_dat: Simulate new data from a numeric and/or logical data frame or...

View source: R/df_sim.R

sim_datR Documentation

Simulate new data from a numeric and/or logical data frame or matrix

Description

Simulates a new dataset, conforming to a reference dataset as closely as possible By resampling the existing data, only possible values are allowed in simulating a new dataset. This function attempts to minimize error in correlational structure, median, and MAD.

Usage

sim_dat(nsim, x, cor_method = "spearman", nTries = 200)

Arguments

nsim

Number of cases to simulate

x

Input data frame or matrix. Must be numeric or logical

cor_method

Minimize error in 'spearman' or 'pearson' correlations?

nTries

This number of resamples are iteratively proposed, and the one that best matches the original is retained.

Details

Error, that is minimized, is defined as: err <- sum((orig_correl - cur_correl)^2)*mean(orig_mad) + sum((orig_med - cur_med)^2)+ sum((orig_mad - cur_mad)^2)

Where orig_correl and cur_correl are the original and current correlation matrices. This is a somewhat arbitrary loss function, but it should do OK for matching new and old datasets.

Next step: take a groupingVar argument, and simulate from subsets of the data for each unique value of groupingVar, then concatenate those subsets of data (while generating random identifiers for each group? Probably should have an anonymize=T argument as well)

Examples

df_raw <- data.frame(x = rnorm(20),y=rep(c(1,2,3,4),5))
df_raw$z <- round((df_raw$x + df_raw$y + rnorm(20,-3,3))*4)/4

df_sim <- sim_dat(20,df_raw)
abs(cor(df_raw) - cor(df_sim))
apply(df_raw,2,mean) ; apply(df_sim,2,mean)
apply(df_raw,2,sd) ; apply(df_sim,2,sd)


akcochrane/ACmisc documentation built on Nov. 24, 2024, 11:22 a.m.