In statisfactions/simpr: Flexible 'Tidyverse'-Friendly Simulations

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(simpr)

simpr is designed with reproducibility in mind. If you set the same seed, you get the same results.

set.seed(500)
run_1 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_1

set.seed(500)
run_2 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_2

identical(run_1, run_2)

What's more, generate() can take filtering criteria, so that you can re-generate specific repetitions or conditions without having to recreate the entire simulation. This requires that the seed, specification, definition, and number of reps is identical to the simulation you are trying to reproduce.

set.seed(500)
filter_after_generating = specify(a = ~ runif(6)) %>% 
  generate(3) %>% 
  filter(.sim_id == 2)

filter_after_generating

## Much faster, same result!
set.seed(500)
filter_while_generating = specify(a = ~ runif(6)) %>% 
  generate(3, .sim_id == 2)

filter_while_generating

identical(filter_after_generating, filter_while_generating)

Although only one repetition was generated above, it is the same data as was generated when we actually did the full simulation.

A common use case is for regenerating the data in cases where an error was created. Here's an example of a simulation that only generated errors in one condition. We generate some data and fit a logistic regression, but notice that we get some errors.

set.seed(500)
fit_tidy = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3) %>% 
  fit(lm = ~ glm(a ~ b, family = "binomial")) %>% 
  tidy_fits()

fit_tidy

One options for regenerating is to filter directly to the problematic max == 10 condition to examine the generated data.

set.seed(500)
filter_max_10 = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, max == 10)

filter_max_10

Looking at the raw generated data, we can see our outcome variable is often larger than 1, which makes no sense for a logistic regression.

In general, we could also filter down to only values of .sim_id which generated errors to examine those:

fit_errors = filter(fit_tidy, !is.na(.fit_error))

set.seed(500)
fit_error_data = specify(a = ~ sample(1:max, size = 10, replace = TRUE),
                     b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, .sim_id %in% fit_errors$.sim_id)

fit_error_data

This approach is useful in cases where we don't know which conditions are producing the errors. Sometimes simulation errors arise from numerical issues arising from unlucky draws from the data-generating mechanism, and are not systematic.

statisfactions/simpr documentation built on July 18, 2024, 6:44 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com