knitr::opts_chunk$set( collapse = TRUE, comment = "#>", # disable R chunk evaluation when using macOS in GH Actions # see https://github.com/tidyverse/ggplot2/issues/2252#issuecomment-1006713187 # and https://github.com/lcolladotor/biocthis/issues/27 and # https://github.com/bodkan/slendr/commit/f0f5ae18452cc9df5d151874e41b0b8fd5f29aa2# eval = Sys.getenv("RUNNER_OS") != "macOS" ) library(simChef) set.seed(12345) # remove old cached results to start fresh for (fname in list.files(file.path("results", "Example Experiment"), pattern = ".rds", recursive = T, full.names = T)) { file.remove(fname) }
Welcome to simChef
! Our goal is to empower data scientists to focus their attention toward scientific best practices by removing the administrative burdens of simulation design. Using simChef
, practitioners can seamlessly and efficiently run simulation experiments using an intuitive tidy
grammar. The core features of simChef
include:
To highlight the ease of running simulation studies using simChef
, we will begin by showcasing a basic example usage in Section \@ref(example-usage). Then, in Section \@ref(writing-your-own-simulation-experiment), we will provide additional details on how to create your own simulation experiment using simChef
.
Simulation Example: Suppose, for concreteness, that we would like to evaluate the prediction accuracy of two methods (linear regression and random forests) under a linear data-generating process across different signal-to-noise ratios.
At its core, simChef
breaks down a simulation experiment into four components:
DGP()
: the data-generating processes (DGPs) from which to generate dataMethod()
: the methods (or models) to fit on the data in the experimentEvaluator()
: the evaluation metrics used to evaluate the methods' performanceVisualizer()
: the visualization procedures used to visualize outputs from the method fits or evaluation results (can be tables, plots, or even R Markdown snippets to display)Using these components, users can easily run a simulation experiment in six simple steps. Below, we summarize these steps and provide code for our running example.
Recall that in our simulation example, we would like to study two methods (linear regression and random forests) under a linear DGP. We code this DGP (linear_dgp_fun
) and methods (linear_reg_fun
and rf_fun
) via custom functions below.
#' Linear Data-Generating Process #' #' @description Generate training and test data according to the classical #' linear regression model: y = X*beta + noise. #' #' @param n_train Number of training samples. #' @param n_test Number of training samples. #' @param p Number of features. #' @param beta Coefficient vector in linear regression function. #' @param noise_sd Standard deviation of additive noise term. linear_dgp_fun <- function(n_train, n_test, p, beta, noise_sd) { n <- n_train + n_test X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X %*% beta + rnorm(n, sd = noise_sd) data_list <- list( X_train = X[1:n_train, , drop = FALSE], y_train = y[1:n_train], X_test = X[(n_train + 1):n, , drop = FALSE], y_test = y[(n_train + 1):n] ) return(data_list) }
#' Linear Regression Method #' #' @param X_train Training data matrix. #' @param y_train Training response vector. #' @param X_test Test data matrix. #' @param y_test Test response vector. linear_reg_fun <- function(X_train, y_train, X_test, y_test) { train_df <- dplyr::bind_cols(data.frame(X_train), y = y_train) fit <- lm(y ~ ., data = train_df) predictions <- predict(fit, data.frame(X_test)) return(list(predictions = predictions, y_test = y_test)) } #' Random Forest Method #' #' @param X_train Training data matrix. #' @param y_train Training response vector. #' @param X_test Test data matrix. #' @param y_test Test response vector. #' @param ... Additional arguments to pass to `ranger::ranger()` rf_fun <- function(X_train, y_train, X_test, y_test, ...) { train_df <- dplyr::bind_cols(data.frame(X_train), y = y_train) fit <- ranger::ranger(y ~ ., data = train_df, ...) predictions <- predict(fit, data.frame(X_test))$predictions return(list(predictions = predictions, y_test = y_test)) }
We can similarly write custom functions to evaluate the prediction accuracy of these methods and to visualize results. However, simChef
also provides a built-in library of helper functions for common evaluation and visualization needs. We leverage this library for convenience here. In particular, for prediction tasks, we will use the functions simChef::summarize_pred_err()
and simChef::plot_pred_err()
to evaluate and plot the prediction error between the observed and predicted responses, respectively. For details, see ? simChef::summarize_pred_err
and ? simChef::plot_pred_err
.
Note: Most of the user-written code is encapsulated by Step 1. From here, there is minimal coding on the user end as we turn to leverage the simChef
grammar for experiments.
DGP()
, Method()
, Evaluator()
, and Visualizer()
class objects. {-}Once we have specified the relevant DGP, method, evaluation, and visualization functions, the next step is to convert these functions into DGP()
, Method()
, Evaluator()
, and Visualizer()
class objects (R6
) for simChef
. To do so, we simply wrap the functions in create_dgp()
, create_method()
, create_evaluator()
, or create_visualizer()
, while specifying (a) an intelligible name for the object and (b) any input parameters to pass to the function. For example,
## DGPs linear_dgp <- create_dgp( .dgp_fun = linear_dgp_fun, .name = "Linear DGP", # additional named parameters to pass to .dgp_fun() n_train = 200, n_test = 200, p = 2, beta = c(1, 0), noise_sd = 1 ) ## Methods linear_reg <- create_method( .method_fun = linear_reg_fun, .name = "Linear Regression" # additional named parameters to pass to .method_fun() ) rf <- create_method( .method_fun = rf_fun, .name = "Random Forest", # additional named parameters to pass to .method_fun() num.threads = 1 ) ## Evaluators pred_err <- create_evaluator( .eval_fun = summarize_pred_err, .name = 'Prediction Accuracy', # additional named parameters to pass to .eval_fun() truth_col = "y_test", estimate_col = "predictions" ) ## Visualizers pred_err_plot <- create_visualizer( .viz_fun = plot_pred_err, .name = 'Prediction Accuracy Plot', # additional named parameters to pass to .viz_fun() eval_name = 'Prediction Accuracy' )
Thus far, we have created the many individual components (i.e., DGP(s)
, Method(s)
, Evaluator(s)
, and Visualizer(s)
) for our simulation experiment. We next assemble or add
these components together to create a complete simulation experiment recipe via:
experiment <- create_experiment(name = "Example Experiment") %>% add_dgp(linear_dgp) %>% add_method(linear_reg) %>% add_method(rf) %>% add_evaluator(pred_err) %>% add_visualizer(pred_err_plot)
We can also vary across one or more parameters in the DGP(s)
and/or Method(s)
by adding a vary_across
component to the simulation experiment recipe. In the running simulation example, we would like to vary across the amount of noise (noise_sd
) in the underlying DGP, which can be done as follows:
experiment <- experiment %>% add_vary_across(.dgp = "Linear DGP", noise_sd = c(0.1, 0.5, 1, 2))
Tip: To see a high-level summary of the simulation experiment recipe, it can be helpful to print the experiment.
print(experiment)
A crucial component when running veridical simulations is documentation. We highly encourage practitioners to document:
DGP(s)
, Method(s)
, Evaluator(s)
, and Visualizer(s)
were used, andDGP(s)
, Method(s)
, Evaluator(s)
, and Visualizer(s)
were chosen. This can and should be done before even running the simulation experiment. To facilitate this tedious but important process, we can easily initialize a documentation template to fill out using
init_docs(experiment)
# copy pre-written .md files in vignettes/example-docs/ to the experiment's docs dir file.copy( from = here::here(file.path("vignettes", "example-docs", experiment$name, "docs")), to = file.path(experiment$get_save_dir()), recursive = TRUE ) file.remove(file.path(experiment$get_save_dir(), "experiment.rds"))
At this point, we have created and documented the simulation experiment recipe, but we have not generated any results from the experiment. That is, we have only given the simulation experiment instructions on what to do. To run the experiment, say over 100 replicates, we can do so via
results <- run_experiment(experiment, n_reps = 100, save = TRUE)
Finally, to visualize and report all results from the simulation experiment, we can render the documentation. This will generate a single html document, including the code, documentation (from Step 4), and simulation results.
render_docs(experiment)
# create pkgdown/assets directory, which will be copied to the gh-pages branch # during the pkgdown workflow (see .github/workflows/pkgdown.yaml) assets_dir <- here::here("pkgdown/assets") dir.create(assets_dir, recursive = TRUE) # copy html output to pkgdown/assets directory for website file.copy(from = file.path(experiment$get_save_dir(), paste0(experiment$name, ".html")), to = file.path(assets_dir, "example_experiment.html"), overwrite = TRUE)
The rendered document corresponding to this simulation example can be found here, and voila! Simulation example complete!
In the next section, we will provide additional details necessary for you to create your own simulation experiment.
When it comes to writing your own simulation experiment using simChef
, the main task is writing the necessary DGP, method, evaluation, and visualization functions for the experiment (i.e,. step 1 above).
Once these functions are appropriately written/specified, simChef
takes care of the rest, with steps 2-6 requiring very little effort from the user.
Thus, focusing on step 1, it is helpful to understand the inputs and outputs that are necessary for writing each user-defined function. To do so, let us walk through what happens when we run a simChef
Experiment
and highlight the necessary inputs and outputs to these functions as we go.
At the highest level, running an experiment (via run_experiment()
) consists of three main steps (see Figure \@ref(fig:run-experiment-figure)):
Method
on each DGP
(and for each of the varying parameter configurations specified in vary_across
) via fit_experiment()
.Evaluator
via evaluate_experiment()
.Visualizer
via visualize_experiment()
.run_experiment()
is simply a wrapper around these three functions: fit_experiment()
, evaluate_experiment()
, and visualize_experiment()
.
{width=100%}
In what follows, we will discuss fitting, evaluating, and visualizing the experiment, each in turn.
When fitting an experiment, each Method
is fit on each DGP
(for each specified vary_across
parameter configuration). As such, the outputs of the user-defined DGP
functions should match input arguments to the user-defined Method
functions. More specifically,
dgp_param1
, dgp_param2
, ...)dgp_out1
, dgp_out2
, ...)dgp_out1
, dgp_out2
, ...)method_param1
, method_param2
, ...)method_out1
, method_out2
, ...)
[Note: this output list should contain all information necessary to evaluate and visualize the method's performance]Finally, fit_experiment()
returns the fitted results across all experimental replicates in a tibble
. This tibble
contains:
.rep
: replicate ID.dgp_name
: name of DGP.method_name
: name of Methodvary_across
parameter columns (if applicable): value of the specified vary_across
parameterMethod
(e.g., method_out1
, method_out2
, ...)We summarize these inputs and outputs in Figure \@ref(fig:fit-experiment-figure).
{width=100%}
Next, when evaluating an experiment, the output of fit_experiment()
(i.e., fit_results
from Figure \@ref(fig:fit-experiment-figure)) as well as a character vector of the vary_across
parameter names (i.e., vary_params
) are automatically passed to each Evaluator
function.
Typically, an Evaluator
takes the fit_results
tibble in as input, evaluates some metric(s) on each row or group of rows (e.g., summarizing across replicates), and returns the resulting tibble
or data.frame
.
We summarize the relevant inputs and outputs below and in Figure \@ref(fig:eval-experiment-figure).
fit_results
(optional): output of fit_experiment()
vary_params
(optional): character vector of parameter names that are varied across in the experimenteval_param1
, eval_param2
, ...)tibble
or data.frame
The output of evaluate_experiment
is a named list of evaluation result tibbles, one component for each Evaluator
in the experiment. Here, the list names correspond to the names given to each Evaluator
in create_evaluator(.name = ...)
, and the list elements are exactly the return value from the Evaluator
function.
{width=100%}
In the last step, when visualizing an experiment, the output of fit_experiment()
and evaluate_experiment
(fit_results
and eval_results
, respectively) as well as vary_params
are automatically passed to each Visualizer
function.
Typically, a Visualizer
takes in the fit_results
and/or eval_results
and constructs a plot to visualize the performance of various methods (perhaps, across the varying parameter specified by vary_params
).
We summarize the relevant inputs and outputs below and in Figure \@ref(fig:viz-experiment-figure).
fit_results
(optional): output of fit_experiment()
eval_results
(optional): output of evaluate_experiment()
vary_params
(optional): character vector of parameter names that are varied across in the experimentviz_param1
, viz_param2
, ...)ggplot
or plotly
object)The output of visualize_experiment
is a named list of visualizations or plots, one component for each Visualizer
in the experiment. Here, the list names correspond to the names given to each Visualizer
in create_visualizer(.name = ...)
, and the list elements are exactly the return value from the Visualizer
function.
{width=100%}
As alluded to previously, once the DGP
, Method
, Evaluator
, and Visualizer
functions have been defined, the hard work has been done. For convenience, we provide below a template that can be used to run an experiment, assuming that dgp_fun
, method_fun
, evaluator_fun
, and visualizer_fun
have been appropriately defined.
## Step 2: Create DGP, Method, Evaluator, Visualizer class objects dgp <- create_dgp( .dgp_fun = dgp_fun, .name = "DGP Name", # additional named parameters to pass to .dgp_fun() dgp_param1 = XXX, dgp_param2 = XXX, ... ) method <- create_method( .method_fun = method_fun, .name = "Method Name", # additional named parameters to pass to .method_fun() method_param1 = XXX, method_param2 = XXX, ... ) evaluator <- create_evaluator( .eval_fun = evaluator_fun, .name = "Evaluator Name", # additional named parameters to pass to .eval_fun() eval_param1 = XXX, eval_param2 = XXX, ... ) visualizer <- create_visualizer( .viz_fun = visualizer_fun, .name = "Visualizer Name", # additional named parameters to pass to .viz_fun() viz_param1 = XXX, viz_param2 = XXX, ... ) ## Step 3: Assemble experiment recipe experiment <- create_experiment(name = "Experiment Name") %>% add_dgp(dgp) %>% add_method(method) %>% add_evaluator(evaluator) %>% add_visualizer(visualizer) %>% # optionally vary across DGP parameter(s) add_vary_across(.dgp = "DGP Name", dgp_param1 = list(XXX, XXX, ...)) %>% # optionally vary across Method parameter(s) add_vary_across(.method = "Method Name", method_param1 = list(XXX, XXX, ...)) ## Step 4: Document experiment recipe init_docs(experiment) ## Step 5: Run experiment results <- run_experiment(experiment, n_reps = 100, save = TRUE) ## Step 6: Render experiment documentation and results render_docs(experiment)
For more information on simChef
and additional features, check out:
simChef
to develop a new flexible approach for predictive biomarker discovery with rendered documentation hereAdditional simChef
examples/case studies and built-in library functions are still to come. Stay tuned!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.