knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 )
Monte Carlo simulations aim to study the properties of statistical inference techniques. At its core, a Monte Carlo simulation works through the application of the techniques to repeatedly drawn samples from a pre-specified data generating process. The tidyMC
package aims to cover and simplify the whole workflow of running a Monte Carlo simulation in either an academic or professional setting. Thus, tidyMC
aims to provide functions for the following tasks:
future_mc()
summary.mc()
plot.mc()
and plot.summary.mc()
LaTeX
table summarizing the results of the Monte Carlo simulation using tidy_mc_latex()
In the following subsections we will show how you can implement those tasks using the tidyMC
package.
Until now, the tidyMC
package is not on CRAN, thus you need to download the development version from GitHub as follows:
# install.packages("devtools") devtools::install_github("stefanlinner/tidyMC")
Afterwards you can load the package:
library(tidyMC)
Moreover, the following packages will be used in this vignette:
# install.packages("magrittr") library(magrittr) # install.packages("ggplot2") library(ggplot2) # install.packages("kableExtra") library(kableExtra)
future_mc()
future_mc()
allows you to run a Monte Carlo simulation for a user defined function and given parameters. The first argument of future_mc()
is fun
which has to be a function that handles the generation of data, the application of the method of interest and the evaluation of the result for a single repetition and parameter combination. future_mc()
handles the generation of loops over the desired parameter grids and the repetition of the Monte Carlo experiment for each of the parameter constellations. Consider the following example for fun
and note that it performs the required tasks of generating data, applying the method and evaluating the results:
# fun ols_test <- function(b0, b1, b2, n, sigma2, param_x1, param_x2, inc_x2){ # generation of data x1 <- rnorm(n = n, mean = param_x1[1], sd = param_x1[2]) x2 <- rnorm(n = n, mean = param_x2[1], sd = param_x2[2]) e <- rnorm(n, sd = sqrt(sigma2)) y <- b0 + b1*x1 + b2*x2 + e if (inc_x2 == 0){ x2 <- x2 * inc_x2 } # application of method estim <- lm(y ~ x1 + x2) # evaluation of the result for a single repetition and parameter combination out <- list(b0 = estim$coefficients[1], b1 = estim$coefficients[2], b2 = estim$coefficients[3], sigma2 = var(estim$residuals)) return(out) }
The second argument of future_mc()
is repetitions
which should be an integer specifying the number of Monte Carlo iterations. While the third argument param_list
should be a list whose components are named after the parameters of fun
which should vary for the different Monte Carlo Simulation and each component is a vector containing the desired grid values for the parameter. Consider the following example for param_list
and note that its components are named accordingly to the parameters of ols_test
: n
and inc_x2
, respectively:
# param_list param_list_ols <- list(n = c(100, 200, 300), inc_x2 = c(0,1))
future_mc()
takes care of creating all possible parameter combinations of param_list
and runs the Monte Carlo simulation over all of these for all possible combinations. If you don't want to run a Monte Carlo simulation for every possible parameter combination you can alternatively define param_table
with a data.frame
or data.table
containing the parameter combinations you are interested in.The ...
argument can be used to specify further arguments of fun
which are not contained in param_list
. Those arguments will be held fixed for all parameter combinations. In our OLS example those arguments are b0
, b1
, b2
, sigma2
, param_x1
, and param_x2
.
Moreover, there are four formal requirements that fun
and thus ols_test
have to fulfill:
fun
which are present in param_list
have to be scalar values. Note that the arguments of ols_test
which are contained in param_list_ols
: n
and inc_x2
are scalar values. The remaining arguments of ols_test
are allowed to take non-scalar values.fun
has either to be defined inside fun
or given as an argument through the ...
argument.fun
has to be a named list. In our example the names of the returned list are b0
, b1
, b2
, and sigma2
.param_list
need to be different. Moreover, they cannot be params
, repetitions
or setup
as these names are already occupied. Note that b0
, b1
, b2
, and sigma2
are the names of the returned values as well as names of arguments of ols_test
. However, none of those arguments is contained in param_list_ols
. If we would add either of those variables to param_list_ols
we would need to change the name of the returned value for the respective variable.We recommend to even further restrict the return value of fun
to be a named list of scalars. This allows you to use all comfort functions of the tidyMC
package. As you can see, we did that for ols_test
.
The argument parallelisation_plan
allows the user to set a parallelisation plan. While the argument parallelisation_options
allows the user to fine tune functions, such as furrr::future_map()
by furrr::furrr_options()
. The argument seed
for furrr::furrr_options()
can be specified in parallelisation_options
following the formal requirements of its respective documentation. Moreover, the user can also decide not to run the Monte Carlo in parallel by setting parallel = FALSE
. To construct parallelisation_plan
the user needs to provide a list named after the arguments of future::plan
. The main argument strategy
needs to provide the specific type of parallelisation the user would like to use and the number of cores which are used in the function. Some of the options for strategy are: multisession
, multicore
and cluster
. We strongly recommend the user to read the documentation of the future
package for a more detailed instruction of how to set up the different strategies.
As a default (check = TRUE
) future_mc()
runs a quick check by running a single test-iteration for each parameter combination in order to check for possible errors in fun
. If a error occurs the user not only receives the error message but also the parameter combinations for which the error occurred:
set.seed(101) first_mc_ols <- future_mc( fun = ols_test, repetitions = 1000, param_list = param_list_ols, b0 = 1, b1 = 4, b2 = 5, sigma2 = -2, param_x1 = c(0,5), param_x2 = c(0,6), check = TRUE )
The attentive reader might already have noticed that we specified sigma2 = -2
which doesn't make sense, as the variance of the error term cannot be negative. This results in a failed check for all parameter combinations, as this parameter is held fixed for any combination. Once we correct that mistake, we can run our first Monte Carlo simulation:
set.seed(101) first_mc_ols <- future_mc( fun = ols_test, repetitions = 1000, param_list = param_list_ols, b0 = 1, b1 = 4, b2 = 5, sigma2 = 2, # correctly specify sigma2 param_x1 = c(0,5), param_x2 = c(0,6), check = TRUE )
future_mc
returns a list of type mc
and length r length(first_mc_ols)
consisting of a tibble (first_mc_ols$output
) containing the return value of fun
for each iteration and parameter combination. In our case first_mc_ols$output
contains a column for each output b0
, b1
, b2
, and sigma2
, as well as a column for each parameter in param_list_ols
and a column containing the nice_names
of the parameter combinations. Overall the first_mc_ols$output
consists of r nrow(first_mc_ols$output)
rows, i.e., for each parameter combination 1.000 rows:
first_mc_ols$output
If ols_test
would not return a named list of scalars, but a named list of non-scalars, then first_mc_ols$output
would not contain a column for each output, but a single column containing the named list of non-scalars for each iteration and parameter combination.
Moreover, first_mc_ols
returns much other information about the Monte Carlo simulation that can be printed in a dense representation:
first_mc_ols
summary.mc()
If fun
returns a named list of scalars the user can use summary.mc()
to summarize all Monte Carlo results. The first argument of the function is an object of class mc
returned by future_mc()
. The next argument sum_funs
determines which summarizing functions will be used on the simulation results. The functions can be provided for any combination of: parameter combinations resulting from param_list
, and the outputs of fun
. Every specified function can only take one argument, which is the vector (with length repetitions
) for every output. We will present all customization options of sum_funs
in a stepwise manner.
The first option of summarizing the results is given by just providing the mc
object to summary.mc
. In this case, mean()
will be applied to all numeric values and summary()
to all non-numeric data types. When the summarizing functions return one numeric value (like mean()
) the results are twofold:
First, a single scalar result of the function evaluated using the complete output vector is returned in the first element.
Second, a vector with length repetitions
of numeric results from the stepwise calculation of the function's result across the output's vector. We call this resulting vector as the "path" of the summarizing function.
Additionally, to save computation time the parameter which_path
is available to the user who wants to specify for which outputs the "path" should be calculated. The user needs to provide a character vector with the output names'. Moreover, the options "all"
(the default) and "none"
are also available.
For the OLS example since all outputs of ols_test
are numeric, the returned object will be a named nested list composed of four elements named after the nice_names
returned by future_mc()
. Each of this elements are itself lists containing the summarized outputs, i.e. b0
, b1
, b2
, and sigma2
. Lastly each of these are composed by the "path" and scalar result of mean()
.
# Default summary_default <- summary(first_mc_ols) summary_default str(summary_default[[1]])
This nested list structure should give an idea of the reach and flexibility the sum_funs
argument is allowed to have, since the user can specify a function for each element in this list.
For an intermediate level of customization, the user can provide a combination of summarizing functions for every output, which will be used for all parameter combination. In the case of the OLS example, if we want to apply mean()
on all estimated coefficients, but we want to use var()
on the MC results of sigma2
, the sum_funs
should have the following structure:
# summarizing output for each parameter combination with one combination sum_funs_ols <- list(b0 = mean, b1 = mean , b2 = mean, sigma2 = var)
Moreover, the user can specify any function provided it takes the output vector as only available argument.
Lastly for the last level of customization, a nested list named after the nice_names
where every element follows the structure of the last example (components named after the outputs and each component is a function) can be specified. We present an example for this:
quantile_sum <- function(x) quantile(x, probs = 0.75) # summarizing output differently for different parameter combinations sum_funs2 <- list( list(b0 = quantile_sum, b1 = min, b2 = min, sigma2 = mean), list(b0 = mean, b1 = quantile_sum, b2 = mean, sigma2 = mean), list(b0 = median, b1 = median, b2 = median, sigma2 = mean), list(b0 = max, b1 = max, b2 = max, sigma2 = mean), list(b0 = min, b1 = min, b2 = min, sigma2 = quantile_sum), list(b0 = mean, b1 = mean, b2 = quantile_sum, sigma2 = quantile_sum) ) names(sum_funs2) <- first_mc_ols$nice_names summary_out_param_spec <- summary(first_mc_ols, sum_funs = sum_funs2)
We would like to reiterate that the provided summary functions are not restricted regarding the complexity of their return value. However, the path of the summarized output over all simulation repetitions is only returned if the provided summary functions return a single numeric value. Thus, the following comfort functions plot.summary.mc()
and tidy_mc_latex()
will only work in this specific case.
plot.mc()
and plot.summary.mc()
If fun
returns a named list of scalars the user can use plot.mc()
to generate a list of objects of class gg
and ggplot2
for all Monte Carlo results. The first argument of the function is an object of class mc
returned by future_mc()
. Using the argument plot
the user can indicate whether the generated plots should be printed immediately or only returned as a list. The list will contain one plot for each output of fun
comparing the results of the different simulation setups. In general, plot.mc()
generates density plots for numeric outputs and bar plots for non-numeric outputs. In our example a plot for b0
, b1
, b2
, and sigma2
will be returned in a list of length four and as b0
, b1
, b2
, and sigma2
are all numeric outputs plot.mc()
will return density plots for each of those:
mc_ols_plot <- plot(first_mc_ols, plot = FALSE) names(mc_ols_plot)
As the single list elements are of class gg
and ggplot2
, we can easily customize and extend the single plots using familiar ggplot2
commands:
mc_ols_plot$b1 + ggplot2::geom_vline(xintercept = 4, col = "red") + ggplot2::theme_minimal()
When creating the plots the user can also subset the setups which he/she would like to see in the plots using the first_mc_ols$nice_names
in the function argument which_setup
, or a named list in parameter_comb
. The single components of the list have to be named after the parameters specified in param_list
and contain vectors specifying the values of the parameters to filter by. In the ols example we can filter by the parameters n
and inc_x2
:
# subsetting by nice_names mc_ols_plot_subset1 <- plot(first_mc_ols, plot = FALSE, which_setup = first_mc_ols$nice_names[4:6]) #subsetting by parameter values mc_ols_plot_subset2 <- plot(first_mc_ols, plot = FALSE, parameter_comb = list(inc_x2 = 1)) mc_ols_plot_subset1$sigma2
Thus, if the user wants distinct plots for every parameter combination, one needs to subset the plot for any single setup in first_mc_ols$nice_names
.
Finally, you can also plot the simulation results for several parameter combination in one single plot by specifying the argument join
with the respective first_mc_ols$nice_names
:
mc_ols_plot_joint <- plot(first_mc_ols, plot = FALSE, join = first_mc_ols$nice_names) mc_ols_plot_joint$b2
Please be aware that the only one of the three arguments which_setup
, parameter_comb
, and join
can be specified at the same time.
If the provided summary functions in summary.mc()
return a single numeric value and thus a path of the summarized output over all simulation repetitions is returned, the user can use plot.summary.mc()
to plot those paths. The first argument of the function is an object of class summary.mc
returned by summary.mc()
. Just as plot.mc()
, plot.summary.mc()
returns a list of objects of class gg
and ggplot2
. The list will contain one line plot for each output of fun
displaying the paths of the results of the different simulation setups. The arguments plot
, which_setup
, parameter_comb
, and join
can be specified the same way as for plot.mc
:
sum_mc_plot <- plot(summary_default, plot = FALSE) sum_mc_plot$b1 + ggplot2::geom_vline(xintercept = 100, col = "red") + ggplot2::theme(axis.text.x = element_text(angle = 45, hjust = 0.1, vjust = 0.2)) sum_mc_plot_subset1 <- plot(summary_default, plot = FALSE, which_setup = first_mc_ols$nice_names[4:6]) sum_mc_plot_subset2 <- plot(summary_default, plot = FALSE, parameter_comb = list(inc_x2 = 1)) sum_mc_plot_subset2$b1 sum_mc_plot_joint <- plot(summary_default, plot = FALSE, join = first_mc_ols$nice_names[4:6]) sum_mc_plot_joint$b1
tidy_mc_latex()
Using tidy_mc_latex
the user can present the results from future_mc
directly into a LaTeX
document using all the benefits from the kableExtra
package. The first and main argument x
needed by tidy_mc_latex
is a summary.mc
object obtained from summary.mc()
. To present the results in a comprehensive manner the function requires that all summarized outputs in summary.mc
be scalar numeric results for all parameter combinations. In case, the summarizing function returns more than one argument then this will be presented in the table as an NA
value. The second argument of the function is repetitions_set
which allows the user to see the certain values of the "path" of the summarized results of fun
. To illustrate this we use the MC results for the OLS example:
tidy_mc_latex( x = summary(first_mc_ols), repetitions_set = c(10, 1000) ) %>% print()
The resulting table is composed of two panels, which corresponds to the length of repetitions_set
. In them the columns correspond to the results of the summarizing functions for b0
, b1
, b2
, and sigma2
, and the rows correspond to specific combinations of the parameters provided in parameter_list
. The footnote in the table shows the number of repetitions and the total parameter combinations provided to future_mc
.
Moreover, the next three arguments in tidy_mc_latex
are comfort options to select which results of summary.mc
depending on the parameter combinations will be presented in the table. On one hand, The argument which_setup
allows the user to make use of the nice_names
of the parameter combinations in the returned object by future_mc()
to subset the rows in the table. On the other hand, the argument parameter_comb
is used to directly filter the parameters by their values. This argument requires a named list, containing vector or scalar values of all parameters to be filtered from. The user must only provide one of this arguments at a time. We show how to make use of both parameters to subset the rows of table for $n = 100$ and $inc_{x2}=1$:
tidy_mc_latex( x = summary(first_mc_ols), repetitions_set = c(10, 1000), which_setup = first_mc_ols$nice_names[1]) %>% print() tidy_mc_latex( x = summary(first_mc_ols), repetitions_set = c(10, 1000), parameter_comb = list(n = 100, inc_x2 = 1)) %>% print()
The user can also subset the outputs of the original function (columns in the table) using the parameter which_out
. This is done using a character vector with the names of the outputs, e.g. to only show the columns for b0
and sigma2
the user needs to set which_out = c("b0", "sigma2")
.
Lastly, by providing a named list to the argument kable_options
, the user can change all arguments of the underlying function kable::kbl()
. The names of the list have to be equal to the names of the arguments of the function and the contents of every element also has to fulfill its requirements. We provide an example of how this list should be constructed, but for optimal usage we strongly recommend the user to see the documentation of the kable::kbl()
function.
To allow for further customization the returned object by tidy_mc_latex
is of class knitr_kable
, therefore the user can utilize most functions from the kableExtra
package in the standard tidyverse manner. For example:
tidy_mc_latex(summary(first_mc_ols), repetitions_set = c(10, 1000), kable_options = list( col.names = c("Number of observations", "$x_2$ included or not", "$\\beta_0$", "$\\beta_1$", "$\\beta_2$", "$s^2$"), caption = "Ommited variable bias MC results" ) ) %>% kableExtra::kable_styling(latex_options = "HOLD_position") %>% print()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.