knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Given the large amount of computation that simulation studies require, one of the main goals of simChef is to make it easy to parallelize your simulations. simChef uses the R package future to distribute simulation replicates across whatever available resources the user specifies. All you have to do to start running your simulations in parallel is set the future plan before calling run_experiment():

n_workers <- availableCores() - 1
plan(multisession, workers = n_workers)

The multisession plan used here will run your simulation experiments on a local (i.e., where R is running) Linux, macOS, or Windows machine, in this case using all but one of the cores. This is very convenient, but it's important to carefully consider two aspects of the distributed computation in order to effectively parallelize the simulations: what plan to use and how to use it.

While simChef works with any valid future plan, one may be better than another for your particular set of experiments. We recommend you carefully read the future docs to learn more about the default plans, as well as alternative plans in packages like future.callr and future.batchtools.

Simulation tasks

When a future plan has been set and the user calls run_experiment, simChef will distribute computation across the resources specified in the plan. Consider n computational "tasks" to be distributed across p parallel workers. In simChef, tasks correspond to simulation replicates, which generate data from a single DGP and fit that data using a single Method, along with associated parameters (either defaults or from those that have been varied in the Experiment).

Assuming each task takes approximately the same amount of time to complete regardless of the worker assigned to the task, then with n=100 and p=4 each worker should complete around 25 of the tasks. In the ideal setting, the total time to complete the 100 tasks should be around 4 times lower than the time it takes one worker to complete them, on average.

Dealing with task heterogeneity

In more realistic scenarios--and especially in simulation experiments which often include heterogeneous methods compared under diverse data-generated processes for a range of sample sizes--tasks can be much less uniform. Different groupings of tasks can have profound implications for the overall running time. Therefore, it's important to carefully decide how to arrange your simulation into separate experiments in order to take greatest advantage of the available parallelism.

simChef distributes the simulation's replicates evenly across available future workers, partially answering the how question. The remainder of the answer comes from you and your specific application, but here are a couple tips:

On the roadmap: nested parallelism

In the future we plan to give more control over how the user splits the computation across workers, with nested parallelism for cases where, e.g., DGPs can be split across a few nodes (e.g., using one of the plan in the package future.batchtools) and each node uses many cores to process the replicates in parallel (e.g., using the future::multicore plan).

If this is something you're interested in, please feel free to contribute to the discussion at https://github.com/Yu-Group/simChef/issues/54.

Example

Putting aside the caveats above for now, parallelization in simChef works without modification other than using future to set a parallel backend. In the example below, we choose the multicore backend (not available on Windows) to create forked R processes using all of the available cores.

This example shows how total replicates can quickly add up when varying across DGP or Method parameters. By varying across parameters of one of the DGPs, we in effect have 17 distinct data generating processes in the experiment (1 for dgp1 and 16 for the combinations of parameters to dgp2), though in actuality there are only two DGP objects. Similarly, we effectively have 4 distinct methods, though there are only 2 Method objects. With n_reps = 2, this results in a total of 2 x 17 x 4 = 136 total rows in the results tibble.

library(simChef)
library(future)
library(dplyr)

n_cores <- availableCores(methods = "system")
n_cores

plan(multicore, workers = n_cores)

dgp_fun1 <- function(n=100, rho=0.5, noise_level=1) {
  cov_mat <- diag(nrow = 5)
  cov_mat[cov_mat == 0] <- rho
  X <- MASS::mvrnorm(n = n, mu = rep(0, 5), Sigma = cov_mat)
  y <- cbind(1, X) %*% c(-8, 3, -1, 0, 0, 0) + rnorm(n, sd = noise_level)
  return(list(X = X, y = y))
}

dgp_fun2 <- function(n=100, d=100, rho=0.5, sparsity=0.5, noise_level=1,
                     nonzero_coeff = c(-3, -1, 1, 3)) {
  cov_mat <- diag(nrow = d)
  cov_mat[cov_mat == 0] <- rho
  X <- MASS::mvrnorm(n = n, mu = rep(0, d), Sigma = cov_mat)
  coeff_prob <- c(sparsity, rep((1 - sparsity) / 4, times = 4))
  coeff <- c(
    -8, # intercept
    sample(
      c(0, nonzero_coeff), size = d, replace = TRUE,
      prob = coeff_prob
    )
  )
  y <- cbind(1, X) %*% coeff + rnorm(n, sd = noise_level)
  return(list(X = X, y = y))
}

dgp1 <- create_dgp(dgp_fun1, .name = "dense_dgp")
dgp2 <- create_dgp(dgp_fun2, .name = "sparse_dgp")

ols <- function(X, y) {
  fit <- lm(y ~ X) %>% broom::tidy()
  return(fit)
}

elnet <- function(X, y, alpha=1) {
  fit <- glmnet::glmnet(
    x = X, y = y, family = "gaussian", alpha = alpha
  ) %>% broom::tidy()
  return(fit)
}

method1 <- create_method(ols, .name = "ols")
method2 <- create_method(elnet, .name = "elnet")

experiment <- create_experiment(
  name = "exper", future.packages = "dplyr") %>%
  add_dgp(dgp1) %>%
  add_dgp(dgp2) %>%
  add_method(method1) %>%
  add_method(method2) %>%
  add_vary_across(
    .dgp = "sparse_dgp",
    d = c(100, 1000),
    rho = c(0.2, 0.9),
    sparsity = c(0.5, 0.9),
    nonzero_coeff = list(c(-3, -1, 1, 3), c(-0.3, -0.1, 0.1, 0.3))
  ) %>%
  add_vary_across(
    .method = "elnet", alpha = c(0, 0.5, 1)
  )

results <- experiment$fit(n_reps = 2)
results

If we find lower computational resource utilization than we'd like, as the simulation grows we might consider breaking this experiment up into separate experiments, e.g., by DGP, method, or parameters like sample size n and number of covariates d, depending on which factors have the greatest impact on task duration.



Yu-Group/simChef documentation built on March 25, 2024, 3:22 a.m.