Monte Carlo Simulations made easy and tidy with tidyMC
In tidyMC: Monte Carlo Simulations Made Easy and Tidy

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.width = 7, 
  fig.height = 5
)

Package overview

Monte Carlo simulations aim to study the properties of statistical inference techniques. At its core, a Monte Carlo simulation works through the application of the techniques to repeatedly drawn samples from a pre-specified data generating process. The tidyMC package aims to cover and simplify the whole workflow of running a Monte Carlo simulation in either an academic or professional setting. Thus, tidyMC aims to provide functions for the following tasks:

Running a Monte Carlo simulation for a user defined function over a parameter grid using future_mc()
Summarizing the results by (optionally) user defined summary functions using summary.mc()
Creating plots of the Monte Carlo simulation results, which can be modified by the user using plot.mc() and plot.summary.mc()
Creating a LaTeX table summarizing the results of the Monte Carlo simulation using tidy_mc_latex()

In the following subsections we will show how you can implement those tasks using the tidyMC package.

Installing tidyMC

Until now, the tidyMC package is not on CRAN, thus you need to download the development version from GitHub as follows:

# install.packages("devtools")
devtools::install_github("stefanlinner/tidyMC")

Afterwards you can load the package:

library(tidyMC)

Moreover, the following packages will be used in this vignette:

# install.packages("magrittr")
library(magrittr)
# install.packages("ggplot2")
library(ggplot2)
# install.packages("kableExtra")
library(kableExtra)

Run your first Monte Carlo Simulation with `future_mc()`

future_mc() allows you to run a Monte Carlo simulation for a user defined function and given parameters. The first argument of future_mc() is fun which has to be a function that handles the generation of data, the application of the method of interest and the evaluation of the result for a single repetition and parameter combination. future_mc() handles the generation of loops over the desired parameter grids and the repetition of the Monte Carlo experiment for each of the parameter constellations. Consider the following example for fun and note that it performs the required tasks of generating data, applying the method and evaluating the results:

# fun
ols_test <- 
  function(b0, b1, b2, n, sigma2, param_x1, param_x2, inc_x2){

    # generation of data
    x1 <- rnorm(n = n, mean = param_x1[1], sd = param_x1[2])
    x2 <- rnorm(n = n,  mean = param_x2[1], sd = param_x2[2])
    e <- rnorm(n, sd = sqrt(sigma2))
    y <- b0 + b1*x1 + b2*x2 + e

    if (inc_x2 == 0){
      x2 <- x2 * inc_x2
    }

    # application of method
    estim <- lm(y ~ x1 + x2)

    # evaluation of the result for a single repetition and parameter combination
    out <- list(b0 = estim$coefficients[1],
                b1 = estim$coefficients[2],
                b2 = estim$coefficients[3],
                sigma2 = var(estim$residuals))
    return(out)
  }

The second argument of future_mc() is repetitions which should be an integer specifying the number of Monte Carlo iterations. While the third argument param_list should be a list whose components are named after the parameters of fun which should vary for the different Monte Carlo Simulation and each component is a vector containing the desired grid values for the parameter. Consider the following example for param_list and note that its components are named accordingly to the parameters of ols_test: n and inc_x2, respectively:

# param_list
param_list_ols <- 
  list(n = c(100, 200, 300), inc_x2 = c(0,1))

future_mc() takes care of creating all possible parameter combinations of param_list and runs the Monte Carlo simulation over all of these for all possible combinations. If you don't want to run a Monte Carlo simulation for every possible parameter combination you can alternatively define param_table with a data.frame or data.table containing the parameter combinations you are interested in.The ... argument can be used to specify further arguments of fun which are not contained in param_list. Those arguments will be held fixed for all parameter combinations. In our OLS example those arguments are b0, b1, b2, sigma2, param_x1, and param_x2.

Moreover, there are four formal requirements that fun and thus ols_test have to fulfill:

The arguments of fun which are present in param_list have to be scalar values. Note that the arguments of ols_test which are contained in param_list_ols: n and inc_x2 are scalar values. The remaining arguments of ols_test are allowed to take non-scalar values.
Every variable used inside fun has either to be defined inside fun or given as an argument through the ... argument.
The value returned by fun has to be a named list. In our example the names of the returned list are b0, b1, b2, and sigma2.
The names of the returned values and those of the arguments contained in param_list need to be different. Moreover, they cannot be params, repetitions or setup as these names are already occupied. Note that b0, b1, b2, and sigma2 are the names of the returned values as well as names of arguments of ols_test. However, none of those arguments is contained in param_list_ols. If we would add either of those variables to param_list_ols we would need to change the name of the returned value for the respective variable.

We recommend to even further restrict the return value of fun to be a named list of scalars. This allows you to use all comfort functions of the tidyMC package. As you can see, we did that for ols_test.

The argument parallelisation_plan allows the user to set a parallelisation plan. While the argument parallelisation_options allows the user to fine tune functions, such as furrr::future_map() by furrr::furrr_options(). The argument seed for furrr::furrr_options() can be specified in parallelisation_options following the formal requirements of its respective documentation. Moreover, the user can also decide not to run the Monte Carlo in parallel by setting parallel = FALSE. To construct parallelisation_plan the user needs to provide a list named after the arguments of future::plan. The main argument strategy needs to provide the specific type of parallelisation the user would like to use and the number of cores which are used in the function. Some of the options for strategy are: multisession, multicore and cluster. We strongly recommend the user to read the documentation of the future package for a more detailed instruction of how to set up the different strategies.

As a default (check = TRUE) future_mc() runs a quick check by running a single test-iteration for each parameter combination in order to check for possible errors in fun. If a error occurs the user not only receives the error message but also the parameter combinations for which the error occurred:

set.seed(101)
first_mc_ols <- future_mc(
  fun = ols_test, 
  repetitions = 1000, 
  param_list = param_list_ols, 
  b0 = 1, 
  b1 = 4, 
  b2 = 5, 
  sigma2 = -2,
  param_x1 = c(0,5),
  param_x2 = c(0,6),
  check = TRUE
)

The attentive reader might already have noticed that we specified sigma2 = -2 which doesn't make sense, as the variance of the error term cannot be negative. This results in a failed check for all parameter combinations, as this parameter is held fixed for any combination. Once we correct that mistake, we can run our first Monte Carlo simulation:

set.seed(101)
first_mc_ols <- future_mc(
  fun = ols_test, 
  repetitions = 1000, 
  param_list = param_list_ols, 
  b0 = 1, 
  b1 = 4, 
  b2 = 5, 
  sigma2 = 2, # correctly specify sigma2
  param_x1 = c(0,5),
  param_x2 = c(0,6),
  check = TRUE
)

future_mc returns a list of type mc and length r length(first_mc_ols) consisting of a tibble (first_mc_ols$output) containing the return value of fun for each iteration and parameter combination. In our case first_mc_ols$output contains a column for each output b0, b1, b2, and sigma2, as well as a column for each parameter in param_list_ols and a column containing the nice_names of the parameter combinations. Overall the first_mc_ols$output consists of r nrow(first_mc_ols$output) rows, i.e., for each parameter combination 1.000 rows:

first_mc_ols$output

If ols_test would not return a named list of scalars, but a named list of non-scalars, then first_mc_ols$output would not contain a column for each output, but a single column containing the named list of non-scalars for each iteration and parameter combination.

Moreover, first_mc_ols returns much other information about the Monte Carlo simulation that can be printed in a dense representation:

first_mc_ols

Summarize your results with `summary.mc()`

If fun returns a named list of scalars the user can use summary.mc() to summarize all Monte Carlo results. The first argument of the function is an object of class mc returned by future_mc(). The next argument sum_funs determines which summarizing functions will be used on the simulation results. The functions can be provided for any combination of: parameter combinations resulting from param_list, and the outputs of fun. Every specified function can only take one argument, which is the vector (with length repetitions) for every output. We will present all customization options of sum_funs in a stepwise manner.

The first option of summarizing the results is given by just providing the mc object to summary.mc. In this case, mean() will be applied to all numeric values and summary() to all non-numeric data types. When the summarizing functions return one numeric value (like mean()) the results are twofold:

First, a single scalar result of the function evaluated using the complete output vector is returned in the first element.
Second, a vector with length repetitions of numeric results from the stepwise calculation of the function's result across the output's vector. We call this resulting vector as the "path" of the summarizing function.

Additionally, to save computation time the parameter which_path is available to the user who wants to specify for which outputs the "path" should be calculated. The user needs to provide a character vector with the output names'. Moreover, the options "all" (the default) and "none" are also available.

For the OLS example since all outputs of ols_test are numeric, the returned object will be a named nested list composed of four elements named after the nice_names returned by future_mc(). Each of this elements are itself lists containing the summarized outputs, i.e. b0, b1, b2, and sigma2. Lastly each of these are composed by the "path" and scalar result of mean().

# Default
summary_default <- summary(first_mc_ols)
summary_default
str(summary_default[[1]])

This nested list structure should give an idea of the reach and flexibility the sum_funs argument is allowed to have, since the user can specify a function for each element in this list.

For an intermediate level of customization, the user can provide a combination of summarizing functions for every output, which will be used for all parameter combination. In the case of the OLS example, if we want to apply mean() on all estimated coefficients, but we want to use var() on the MC results of sigma2, the sum_funs should have the following structure:

# summarizing output for each parameter combination with one combination
sum_funs_ols <- list(b0 = mean, b1 = mean , b2 = mean, sigma2 = var)

Moreover, the user can specify any function provided it takes the output vector as only available argument.

Lastly for the last level of customization, a nested list named after the nice_names where every element follows the structure of the last example (components named after the outputs and each component is a function) can be specified. We present an example for this:

quantile_sum <- function(x) quantile(x, probs = 0.75)

# summarizing output differently for different parameter combinations
sum_funs2 <-
  list(
    list(b0 = quantile_sum, b1 = min, b2 = min, sigma2 = mean),
    list(b0 = mean, b1 = quantile_sum, b2 = mean, sigma2 = mean),
    list(b0 = median, b1 = median, b2 = median, sigma2 = mean),
    list(b0 = max, b1 = max, b2 = max, sigma2 = mean),
    list(b0 = min, b1 = min, b2 = min, sigma2 = quantile_sum),
    list(b0 = mean, b1 = mean, b2 = quantile_sum, sigma2 = quantile_sum)
  )
names(sum_funs2) <- first_mc_ols$nice_names
summary_out_param_spec <- summary(first_mc_ols, sum_funs = sum_funs2)

We would like to reiterate that the provided summary functions are not restricted regarding the complexity of their return value. However, the path of the summarized output over all simulation repetitions is only returned if the provided summary functions return a single numeric value. Thus, the following comfort functions plot.summary.mc() and tidy_mc_latex() will only work in this specific case.

Plot your Monte Carlo Simulation with `plot.mc()` and `plot.summary.mc()`

If fun returns a named list of scalars the user can use plot.mc() to generate a list of objects of class gg and ggplot2 for all Monte Carlo results. The first argument of the function is an object of class mc returned by future_mc(). Using the argument plot the user can indicate whether the generated plots should be printed immediately or only returned as a list. The list will contain one plot for each output of fun comparing the results of the different simulation setups. In general, plot.mc() generates density plots for numeric outputs and bar plots for non-numeric outputs. In our example a plot for b0, b1, b2, and sigma2 will be returned in a list of length four and as b0, b1, b2, and sigma2 are all numeric outputs plot.mc() will return density plots for each of those:

mc_ols_plot <- plot(first_mc_ols, plot = FALSE)
names(mc_ols_plot)

As the single list elements are of class gg and ggplot2, we can easily customize and extend the single plots using familiar ggplot2 commands:

mc_ols_plot$b1 + 
  ggplot2::geom_vline(xintercept = 4, col = "red") + 
  ggplot2::theme_minimal()

When creating the plots the user can also subset the setups which he/she would like to see in the plots using the first_mc_ols$nice_names in the function argument which_setup, or a named list in parameter_comb. The single components of the list have to be named after the parameters specified in param_list and contain vectors specifying the values of the parameters to filter by. In the ols example we can filter by the parameters n and inc_x2:

# subsetting by nice_names
mc_ols_plot_subset1 <- 
  plot(first_mc_ols, plot = FALSE, which_setup = first_mc_ols$nice_names[4:6])
#subsetting by parameter values
mc_ols_plot_subset2 <- 
  plot(first_mc_ols, plot = FALSE, parameter_comb = list(inc_x2 = 1))

mc_ols_plot_subset1$sigma2

Thus, if the user wants distinct plots for every parameter combination, one needs to subset the plot for any single setup in first_mc_ols$nice_names.

Finally, you can also plot the simulation results for several parameter combination in one single plot by specifying the argument join with the respective first_mc_ols$nice_names:

mc_ols_plot_joint <- plot(first_mc_ols, plot = FALSE, 
                          join = first_mc_ols$nice_names)
mc_ols_plot_joint$b2

Please be aware that the only one of the three arguments which_setup, parameter_comb, and join can be specified at the same time.

If the provided summary functions in summary.mc() return a single numeric value and thus a path of the summarized output over all simulation repetitions is returned, the user can use plot.summary.mc() to plot those paths. The first argument of the function is an object of class summary.mc returned by summary.mc(). Just as plot.mc(), plot.summary.mc() returns a list of objects of class gg and ggplot2. The list will contain one line plot for each output of fun displaying the paths of the results of the different simulation setups. The arguments plot, which_setup, parameter_comb, and join can be specified the same way as for plot.mc:

sum_mc_plot <- plot(summary_default, plot = FALSE)
sum_mc_plot$b1 + 
  ggplot2::geom_vline(xintercept = 100, col = "red") +
  ggplot2::theme(axis.text.x = element_text(angle = 45, 
                                            hjust = 0.1, 
                                            vjust = 0.2))

sum_mc_plot_subset1 <- 
  plot(summary_default, 
       plot = FALSE, 
       which_setup = first_mc_ols$nice_names[4:6])

sum_mc_plot_subset2 <- 
  plot(summary_default, 
       plot = FALSE, 
       parameter_comb = list(inc_x2 = 1))

sum_mc_plot_subset2$b1

sum_mc_plot_joint <- 
  plot(summary_default, plot = FALSE, join = first_mc_ols$nice_names[4:6])

sum_mc_plot_joint$b1

Create a 'LaTeX' table of your results with `tidy_mc_latex()`

Using tidy_mc_latex the user can present the results from future_mc directly into a LaTeX document using all the benefits from the kableExtra package. The first and main argument x needed by tidy_mc_latex is a summary.mc object obtained from summary.mc(). To present the results in a comprehensive manner the function requires that all summarized outputs in summary.mc be scalar numeric results for all parameter combinations. In case, the summarizing function returns more than one argument then this will be presented in the table as an NA value. The second argument of the function is repetitions_set which allows the user to see the certain values of the "path" of the summarized results of fun. To illustrate this we use the MC results for the OLS example:

tidy_mc_latex(
  x = summary(first_mc_ols),
  repetitions_set = c(10, 1000)
) %>% 
  print()

The resulting table is composed of two panels, which corresponds to the length of repetitions_set. In them the columns correspond to the results of the summarizing functions for b0, b1, b2, and sigma2, and the rows correspond to specific combinations of the parameters provided in parameter_list. The footnote in the table shows the number of repetitions and the total parameter combinations provided to future_mc.

Moreover, the next three arguments in tidy_mc_latex are comfort options to select which results of summary.mc depending on the parameter combinations will be presented in the table. On one hand, The argument which_setup allows the user to make use of the nice_names of the parameter combinations in the returned object by future_mc() to subset the rows in the table. On the other hand, the argument parameter_comb is used to directly filter the parameters by their values. This argument requires a named list, containing vector or scalar values of all parameters to be filtered from. The user must only provide one of this arguments at a time. We show how to make use of both parameters to subset the rows of table for $n = 100$ and $inc_{x2}=1$:

tidy_mc_latex(
  x = summary(first_mc_ols),
  repetitions_set = c(10, 1000),
  which_setup = first_mc_ols$nice_names[1]) %>% 
  print()

tidy_mc_latex(
  x = summary(first_mc_ols),
  repetitions_set = c(10, 1000),
  parameter_comb = list(n = 100, inc_x2 = 1)) %>% 
  print()

The user can also subset the outputs of the original function (columns in the table) using the parameter which_out. This is done using a character vector with the names of the outputs, e.g. to only show the columns for b0 and sigma2 the user needs to set which_out = c("b0", "sigma2").

Lastly, by providing a named list to the argument kable_options, the user can change all arguments of the underlying function kable::kbl(). The names of the list have to be equal to the names of the arguments of the function and the contents of every element also has to fulfill its requirements. We provide an example of how this list should be constructed, but for optimal usage we strongly recommend the user to see the documentation of the kable::kbl() function.

To allow for further customization the returned object by tidy_mc_latex is of class knitr_kable, therefore the user can utilize most functions from the kableExtra package in the standard tidyverse manner. For example:

tidy_mc_latex(summary(first_mc_ols), 
              repetitions_set = c(10, 1000),
              kable_options = list(
                col.names = c("Number of observations",
                              "$x_2$ included or not",
                              "$\\beta_0$", "$\\beta_1$",
                              "$\\beta_2$", "$s^2$"), 
                caption = "Ommited variable bias MC results"
              )
) %>%
  kableExtra::kable_styling(latex_options = "HOLD_position") %>% 
  print()