knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Given the large amount of computation that simulation studies require, one of
the main goals of simChef
is to make it easy to parallelize your simulations.
simChef
uses the R package future
to
distribute simulation replicates across whatever available resources the user
specifies. All you have to do to start running your simulations in parallel is
set the future
plan before calling run_experiment()
:
n_workers <- availableCores() - 1 plan(multisession, workers = n_workers)
The multisession
plan used here will run your simulation experiments on a
local (i.e., where R is running) Linux, macOS, or Windows machine, in this case
using all but one of the cores. This is very convenient, but it's important to
carefully consider two aspects of the distributed computation in order to
effectively parallelize the simulations: what plan to use and how to use it.
While simChef
works with any valid future
plan, one may be better than
another for your particular set of experiments. We recommend you carefully read
the future
docs
to learn more about the default plans, as well as alternative plans in packages
like future.callr
and
future.batchtools
.
When a future
plan has been set and the user calls run_experiment
, simChef
will distribute computation across the resources specified in the plan. Consider
n
computational "tasks" to be distributed across p
parallel workers. In
simChef
, tasks correspond to simulation replicates, which generate data from a
single DGP
and fit that data using a single Method
, along with associated
parameters (either defaults or from those that have been varied in the
Experiment
).
Assuming each task takes approximately the same amount of time to complete
regardless of the worker assigned to the task, then with n=100
and p=4
each
worker should complete around 25 of the tasks. In the ideal setting, the total
time to complete the 100 tasks should be around 4 times lower than the time it
takes one worker to complete them, on average.
In more realistic scenarios--and especially in simulation experiments which often include heterogeneous methods compared under diverse data-generated processes for a range of sample sizes--tasks can be much less uniform. Different groupings of tasks can have profound implications for the overall running time. Therefore, it's important to carefully decide how to arrange your simulation into separate experiments in order to take greatest advantage of the available parallelism.
simChef
distributes the simulation's replicates evenly across available
future
workers, partially answering the how question. The remainder of the
answer comes from you and your specific application, but here are a couple tips:
n>>p
very small tasks as the overhead of distributing computation to workers
may outweigh the benefits of parallelism.Experiment
for each task group ends up decreasing the overall simulation running time
because workers with small tasks spend less time idly waiting for workers with
large tasks to finish. Using the clone_from
argument in
create_experiment()
, you can copy an existing experiment and modify it so
that tasks have similar sizes, repeating this process for each group of
similarly-sized tasks.progressr
package to
get updates as the experiment computation progresses.options(simChef.debug = TRUE)
to get helpful debugging output as an
Experiment
works on it's tasks, including info on memory usage. This may
slow things down quite a bit, so don't use it when you run the full
simulation.In the future we plan to give more control over how the user splits the
computation across workers, with nested parallelism for cases where, e.g.,
DGPs
can be split across a few nodes (e.g., using one of the plan in the
package future.batchtools
) and
each node uses many cores to process the replicates in parallel (e.g., using the
future::multicore
plan).
If this is something you're interested in, please feel free to contribute to the discussion at https://github.com/Yu-Group/simChef/issues/54.
Putting aside the caveats above for now, parallelization in simChef
works
without modification other than using future
to set a parallel backend. In the
example below, we choose the multicore
backend (not available on Windows) to
create forked R processes using all of the available cores.
This example shows how total replicates can quickly add up when varying across
DGP
or Method
parameters. By varying across parameters of one of the DGPs
,
we in effect have 17 distinct data generating processes in the experiment (1 for
dgp1
and 16 for the combinations of parameters to dgp2
), though in actuality
there are only two DGP
objects. Similarly, we effectively have 4 distinct
methods, though there are only 2 Method
objects. With n_reps = 2
, this
results in a total of 2 x 17 x 4 = 136 total rows in the results tibble
.
library(simChef) library(future) library(dplyr) n_cores <- availableCores(methods = "system") n_cores plan(multicore, workers = n_cores) dgp_fun1 <- function(n=100, rho=0.5, noise_level=1) { cov_mat <- diag(nrow = 5) cov_mat[cov_mat == 0] <- rho X <- MASS::mvrnorm(n = n, mu = rep(0, 5), Sigma = cov_mat) y <- cbind(1, X) %*% c(-8, 3, -1, 0, 0, 0) + rnorm(n, sd = noise_level) return(list(X = X, y = y)) } dgp_fun2 <- function(n=100, d=100, rho=0.5, sparsity=0.5, noise_level=1, nonzero_coeff = c(-3, -1, 1, 3)) { cov_mat <- diag(nrow = d) cov_mat[cov_mat == 0] <- rho X <- MASS::mvrnorm(n = n, mu = rep(0, d), Sigma = cov_mat) coeff_prob <- c(sparsity, rep((1 - sparsity) / 4, times = 4)) coeff <- c( -8, # intercept sample( c(0, nonzero_coeff), size = d, replace = TRUE, prob = coeff_prob ) ) y <- cbind(1, X) %*% coeff + rnorm(n, sd = noise_level) return(list(X = X, y = y)) } dgp1 <- create_dgp(dgp_fun1, .name = "dense_dgp") dgp2 <- create_dgp(dgp_fun2, .name = "sparse_dgp") ols <- function(X, y) { fit <- lm(y ~ X) %>% broom::tidy() return(fit) } elnet <- function(X, y, alpha=1) { fit <- glmnet::glmnet( x = X, y = y, family = "gaussian", alpha = alpha ) %>% broom::tidy() return(fit) } method1 <- create_method(ols, .name = "ols") method2 <- create_method(elnet, .name = "elnet") experiment <- create_experiment( name = "exper", future.packages = "dplyr") %>% add_dgp(dgp1) %>% add_dgp(dgp2) %>% add_method(method1) %>% add_method(method2) %>% add_vary_across( .dgp = "sparse_dgp", d = c(100, 1000), rho = c(0.2, 0.9), sparsity = c(0.5, 0.9), nonzero_coeff = list(c(-3, -1, 1, 3), c(-0.3, -0.1, 0.1, 0.3)) ) %>% add_vary_across( .method = "elnet", alpha = c(0, 0.5, 1) ) results <- experiment$fit(n_reps = 2) results
If we find lower computational resource utilization than we'd like, as the
simulation grows we might consider breaking this experiment up into separate
experiments, e.g., by DGP, method, or parameters like sample size n
and number
of covariates d
, depending on which factors have the greatest impact on task
duration.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.