knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(simpr)

simpr is designed with reproducibility in mind. If you set the same seed, you get the same results.

set.seed(500)
run_1 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_1
set.seed(500)
run_2 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_2
identical(run_1, run_2)

What's more, generate() can take filtering criteria, so that you can re-generate specific repetitions or conditions without having to recreate the entire simulation. This requires that the seed, specification, definition, and number of reps is identical to the simulation you are trying to reproduce.

set.seed(500)
filter_after_generating = specify(a = ~ runif(6)) %>% 
  generate(3) %>% 
  filter(.sim_id == 2)

filter_after_generating
## Much faster, same result!
set.seed(500)
filter_while_generating = specify(a = ~ runif(6)) %>% 
  generate(3, .sim_id == 2)

filter_while_generating
identical(filter_after_generating, filter_while_generating)

Although only one repetition was generated above, it is the same data as was generated when we actually did the full simulation.

A common use case is for regenerating the data in cases where an error was created. Here's an example of a simulation that only generated errors in one condition. We generate some data and fit a logistic regression, but notice that we get some errors.

set.seed(500)
fit_tidy = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3) %>% 
  fit(lm = ~ glm(a ~ b, family = "binomial")) %>% 
  tidy_fits()

fit_tidy

One options for regenerating is to filter directly to the problematic max == 10 condition to examine the generated data.

set.seed(500)
filter_max_10 = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, max == 10)

filter_max_10

Looking at the raw generated data, we can see our outcome variable is often larger than 1, which makes no sense for a logistic regression.

In general, we could also filter down to only values of .sim_id which generated errors to examine those:

fit_errors = filter(fit_tidy, !is.na(.fit_error))

set.seed(500)
fit_error_data = specify(a = ~ sample(1:max, size = 10, replace = TRUE),
                     b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, .sim_id %in% fit_errors$.sim_id)

fit_error_data

This approach is useful in cases where we don't know which conditions are producing the errors. Sometimes simulation errors arise from numerical issues arising from unlucky draws from the data-generating mechanism, and are not systematic.



statisfactions/simpr documentation built on July 18, 2024, 6:44 a.m.