knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%" ) ggplot2::theme_set(ggplot2::theme_bw()) set.seed(8675309)
library(ggplot2) library(dplyr) library(tidyr) library(faux)
The sim_df()
function produces a data table with the same distributions and correlations as an existing data table. It simulates all numeric variables from a continuous normal distribution (for now).
For example, here is the relationship between speed and distance in the built-in dataset cars
.
cars %>% ggplot(aes(speed, dist)) + geom_point() + geom_smooth(method = "lm", formula = "y~x")
You can create a new sample with the same parameters and 500 rows with the code sim_df(cars, 500)
.
sim_df(cars, 500) %>% ggplot(aes(speed, dist)) + geom_point() + geom_smooth(method = "lm", formula = "y~x")
You can also optionally add between-subject variables. For example, here is the relationship between horsepower (hp
) and weight (wt
) for automatic (am = 0
) versus manual (am = 1
) transmission in the built-in dataset mtcars
.
mtcars %>% mutate(transmission = factor(am, labels = c("automatic", "manual"))) %>% ggplot(aes(hp, wt, color = transmission)) + geom_point() + geom_smooth(method = "lm", formula = "y~x")
And here is a new sample with 50 observations of each.
sim_df(mtcars, 50 , between = "am") %>% mutate(transmission = factor(am, labels = c("automatic", "manual"))) %>% ggplot(aes(hp, wt, color = transmission)) + geom_point() + geom_smooth(method = "lm", formula = "y~x")
Set empirical = TRUE
to return a data frame with exactly the same means, SDs, and correlations as the original dataset.
exact_mtcars <- sim_df(mtcars, 50, between = "am", empirical = TRUE)
For now, the function only creates new variables sampled from a continuous normal distribution. I hope to add in other sampling distributions in the future. So you'd need to do any rounding or truncating yourself.
sim_df(mtcars, 50, between = "am") %>% mutate(hp = round(hp), transmission = factor(am, labels = c("automatic", "manual"))) %>% ggplot(aes(hp, wt, color = transmission)) + geom_point() + geom_smooth(method = "lm", formula = "y~x")
As of faux 0.0.1.8, if you want to simulate missing data, set missing = TRUE
and sim_df
will simulate missing data with the same joint probabilities as your data. In the dataset below, in condition B1a, 30% of W1a values are missing and 60% of W1b values are missing. This is correlated so that there is a 100% chance that W1b is missing if W1a is. There is no missing data for condition B1b.
data <- sim_design(2, 2, n = 10, plot = FALSE) data$W1a[1:3] <- NA data$W1b[1:6] <- NA data
The simulated data will have the same pattern of missingness (sampled from the joint distribution, so it won't be exact).
simdat <- sim_df(data, between = "B1", n = 1000, missing = TRUE)
simdat %>% mutate(W1a = ifelse(is.na(W1a), "NA", "not NA"), W1b = ifelse(is.na(W1b), "NA", "not NA")) %>% count(B1, W1a, W1b) %>% group_by(B1) %>% mutate(n = round(n/sum(n), 2)) %>% knitr::kable()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.