View source: R/runSimulation.R
runSimulation | R Documentation |
This function runs a Monte Carlo simulation study given a set of predefined simulation functions,
design conditions, and number of replications. Results can be saved as temporary files in case of
interruptions and may be restored by re-running runSimulation
, provided that the respective temp
file can be found in the working directory. runSimulation
supports parallel
and cluster computing (with the parallel
and
future
packages; see also
runArraySimulation
for submitting array jobs to HPC clusters),
global and local debugging, error handling (including fail-safe
stopping when functions fail too often, even across nodes), provides bootstrap estimates of the
sampling variability (optional), and automatic tracking of error and warning messages
with their associated .Random.seed
states.
For convenience, all functions available in the R work-space are exported across all nodes
so that they are more easily accessible (however, other R objects are not, and therefore
must be passed to the fixed_objects
input to become available across nodes).
runSimulation(
design,
replications,
generate,
analyse,
summarise,
fixed_objects = NULL,
packages = NULL,
filename = NULL,
debug = "none",
load_seed = NULL,
save = any(replications > 2),
store_results = TRUE,
save_results = FALSE,
parallel = FALSE,
ncores = parallelly::availableCores(omit = 1L),
cl = NULL,
notification = "none",
beep = FALSE,
sound = 1,
CI = 0.95,
seed = NULL,
boot_method = "none",
boot_draws = 1000L,
max_errors = 50L,
resume = TRUE,
save_details = list(),
control = list(),
progress = TRUE,
verbose = TRUE
)
## S3 method for class 'SimDesign'
summary(object, ...)
## S3 method for class 'SimDesign'
print(x, list2char = TRUE, ...)
design |
a |
replications |
number of independent replications to perform per
condition (i.e., each row in |
generate |
user-defined data and parameter generating function (or named list of functions).
See |
analyse |
user-defined analysis function (or named list of functions)
that acts on the data generated from
|
summarise |
optional (but strongly recommended) user-defined summary function
from Note that unlike the Generate and Analyse
steps, the Summarise portion is not as important to perfectly organize
as the results can be summarised later on by using the built-in
Omitting this function will return a tibble with the |
fixed_objects |
(optional) an object (usually a named |
packages |
a character vector of external packages to be used during the simulation (e.g.,
|
filename |
(optional) the name of the Note that if the same file name already exists in the working
directly at the time of saving then a new
file will be generated instead and a warning will be thrown. This helps to
avoid accidentally overwriting
existing files. Default is |
debug |
a string indicating where to initiate a If the For debugging specific rows in the Finally, users may place |
load_seed |
used to replicate an exact simulation state, which is
primarily useful for debugging purposes.
Input can be a character object indicating which file to load from when the
|
save |
logical; save the temporary simulation state to the hard-drive? This is useful
for simulations which require an extended amount of time, though for shorter simulations
can be disabled to slightly improve computational efficiency. When To recover your simulation at the last known location (having patched the issues in the
previous execution code) simply re-run the code you used to
initially define the simulation and the external file will automatically be detected and read-in.
Default is |
store_results |
logical; store the complete tables of simulation results
in the returned object? This is To extract these results
pass the returned object to either |
save_results |
logical; save the results returned from WARNING: saving results to your hard-drive can fill up space very quickly for
larger simulations. Be sure to
test this option using a smaller number of replications before the full Monte
Carlo simulation is performed.
See also |
parallel |
logical; use parallel processing from the Alternatively, if the |
ncores |
number of cores to be used in parallel execution (ignored if using the
|
cl |
cluster object defined by If the |
notification |
an optional character vector input that can be used to send
Pushbullet notifications from a configured
computer. This reports information such as the total execution time, the condition
completed, and error/warning
messages recorded. This arguments assumes that users have already A) registered for
a Pushbullet account,
B) installed the application on their mobile device and computer, and C) created an
associated JSON file of the form
To utilize the |
beep |
logical; call the |
sound |
|
CI |
bootstrap confidence interval level (default is 95%) |
seed |
a vector or list of integers to be used for reproducibility.
The length of the vector must be equal the number of rows in |
boot_method |
method for performing non-parametric bootstrap confidence intervals
for the respective meta-statistics computed by the |
boot_draws |
number of non-parametric bootstrap draws to sample for the |
max_errors |
the simulation will terminate when more than this number of consecutive
errors are thrown in any
given condition, causing the simulation to continue to the next unique |
resume |
logical; if a temporary |
save_details |
a list pertaining to information regarding how and where files should be saved
when the
|
control |
a list for extra information flags for controlling less commonly used features. These include
|
progress |
logical; display a progress bar (using the |
verbose |
logical; print messages to the R console? Default is |
object |
SimDesign object returned from |
... |
additional arguments |
x |
SimDesign object returned from |
list2char |
logical; for |
For an in-depth tutorial of the package please refer to Chalmers and Adkins (2020;
\Sexpr[results=rd]{tools:::Rd_expr_doi("10.20982/tqmp.16.4.p248")}).
For an earlier didactic presentation of the package refer to Sigal and Chalmers
(2016; \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/10691898.2016.1246953")}). Finally, see the associated
wiki on Github (https://github.com/philchalmers/SimDesign/wiki)
for tutorial material, examples, and applications of SimDesign
to real-world
simulation experiments, as well as the various vignette files associated with the package.
The strategy for organizing the Monte Carlo simulation work-flow is to
Define a suitable Design
object (a tibble
or data.frame
)
containing fixed conditional
information about the Monte Carlo simulations. Each row or this design
object pertains
to a unique set of simulation to study, while each column the simulation factor under
investigation (e.g., sample size,
distribution types, etc). This is often expedited by using the
createDesign
function, and if necessary the argument subset
can be used to remove redundant or non-applicable rows
Define the three step functions to generate the data (Generate
; see also
https://CRAN.R-project.org/view=Distributions for a list of distributions in R),
analyse the generated data by computing the respective parameter estimates, detection rates,
etc (Analyse
), and finally summarise the results across the total
number of replications (Summarise
).
Pass the design
object and three defined R functions to runSimulation
,
and declare the number of replications to perform with the replications
input.
This function will return a suitable
tibble
object with the complete simulation results and execution details
Analyze the output from runSimulation
, possibly using ANOVA techniques
(SimAnova
) and generating suitable plots and tables
Expressing the above more succinctly, the functions to be called have the following form, with the exact functional arguments listed:
Design <- createDesign(...)
Generate <- function(condition, fixed_objects) {...}
Analyse <- function(condition, dat, fixed_objects) {...}
Summarise <- function(condition, results, fixed_objects) {...}
res <- runSimulation(design=Design, replications, generate=Generate,
analyse=Analyse, summarise=Summarise)
The condition
object above represents a single row from the design
object, indicating
a unique Monte Carlo simulation condition. The condition
object also contains two
additional elements to help track the simulation's state: an ID
variable, indicating
the respective row number in the design
object, and a REPLICATION
element
indicating the replication iteration number (an integer value between 1 and replication
).
This setup allows users to easily locate the r
th replication (e.g., REPLICATION == 500
)
within the j
th row in the simulation design (e.g., ID == 2
). The
REPLICATION
input is also useful when temporarily saving files to the hard-drive
when calling external command line utilities (see examples on the wiki).
For a template-based version of the work-flow, which is often useful when initially
defining a simulation, use the SimFunctions
function. This
function will write a template simulation
to one/two files so that modifying the required functions and objects can begin immediately.
This means that users can focus on their Monte Carlo simulation details right away rather
than worrying about the repetitive administrative code-work required to organize the simulation's
execution flow.
Finally, examples, presentation files, and tutorials can be found on the package wiki located at https://github.com/philchalmers/SimDesign/wiki.
a tibble
from the dplyr
package (also of class 'SimDesign'
)
with the original design
conditions in the left-most columns,
simulation results in the middle columns, and additional information in the right-most columns (see below).
The right-most column information for each condition are:
REPLICATIONS
to indicate the number of Monte Carlo replications,
SIM_TIME
to indicate how long (in seconds) it took to complete
all the Monte Carlo replications for each respective design condition,
RAM_USED
amount of RAM that was in use at the time of completing
each simulation condition,
COMPLETED
to indicate the date in which the given simulation condition completed,
SEED
for the integer values in the seed
argument (if applicable; not
relevant when L'Ecuyer-CMRG method used), and, if applicable,
ERRORS
and WARNINGS
which contain counts for the number of error or warning
messages that were caught (if no errors/warnings were observed these columns will be omitted).
Note that to extract the specific error and warnings messages see
SimExtract
. Finally,
if boot_method
was a valid input other than 'none' then the final right-most
columns will contain the labels
BOOT_
followed by the name of the associated meta-statistic defined in summarise()
and
and bootstrapped confidence interval location for the meta-statistics.
To conserve RAM, temporary objects (such as data generated across conditions and replications)
are discarded; however, these can be saved to the hard-disk by passing the appropriate flags.
For longer simulations it is recommended to use the save_results
flag to write the
analysis results to the hard-drive.
The use of the store_seeds
or the save_seeds
options
can be evoked to save R's .Random.seed
state to allow for complete reproducibility of each replication within each condition. These
individual .Random.seed
terms can then be read in with the
load_seed
input to reproduce the exact simulation state at any given replication.
Most often though, saving the complete list of seeds is unnecessary as problematic seeds are
automatically stored in the final simulation object to allow for easier replicability
of potentially problematic errors (which incidentally can be extracted
using SimExtract(res, 'error_seeds')
and passed to the load_seed
argument). Finally,
providing a vector of seeds
is also possible to ensure
that each simulation condition is macro reproducible under the single/multi-core method selected.
Finally, when the Monte Carlo simulation is complete
it is recommended to write the results to a hard-drive for safe keeping, particularly with the
filename
argument provided (for reasons that are more obvious in the parallel computation
descriptions below). Using the filename
argument supplied is safer than using, for instance,
saveRDS
directly because files will never accidentally be overwritten,
and instead a new file name will be created when a conflict arises; this type of implementation safety
is prevalent in many locations in the package to help avoid unrecoverable (yet surprisingly
common) mistakes during the process of designing and executing Monte Carlo simulations.
In the event of a computer crash, power outage, etc, if save = TRUE
was used (the default)
then the original code used to execute runSimulation()
need only be re-run to resume
the simulation. The saved temp file will be read into the function automatically, and the
simulation will continue one the condition where it left off before the simulation
state was terminated. If users wish to remove this temporary
simulation state entirely so as to start anew then simply pass SimClean(temp = TRUE)
in the R console to remove any previously saved temporary objects.
When running simulations in parallel (either with parallel = TRUE
or when using the future
approach with a plan()
other than sequential)
R objects defined in the global environment will generally not be visible across nodes.
Hence, you may see errors such as Error: object 'something' not found
if you try to use
an object that is defined in the work space but is not passed to runSimulation
.
To avoid this type or error, simply pass additional objects to the
fixed_objects
input (usually it's convenient to supply a named list of these objects).
Fortunately, however, custom functions defined in the global environment are exported across
nodes automatically. This makes it convenient when writing code because custom functions will
always be available across nodes if they are visible in the R work space. As well, note the
packages
input to declare packages which must be loaded via library()
in order to make
specific non-standard R functions available across nodes.
Phil Chalmers rphilip.chalmers@gmail.com
Chalmers, R. P., & Adkins, M. C. (2020). Writing Effective and Reliable Monte Carlo Simulations
with the SimDesign Package. The Quantitative Methods for Psychology, 16
(4), 248-280.
\Sexpr[results=rd]{tools:::Rd_expr_doi("10.20982/tqmp.16.4.p248")}
Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte
Carlo simulation. Journal of Statistics Education, 24
(3), 136-156.
\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/10691898.2016.1246953")}
SimFunctions
, createDesign
,
Generate
, Analyse
, Summarise
,
SimExtract
,
reSummarise
, SimClean
, SimAnova
, SimResults
,
SimCollect
, Attach
, AnalyseIf
,
SimShiny
, manageWarnings
, runArraySimulation
#-------------------------------------------------------------------------------
# Example 1: Sampling distribution of mean
# This example demonstrate some of the simpler uses of SimDesign,
# particularly for classroom settings. The only factor varied in this simulation
# is sample size.
# skeleton functions to be saved and edited
SimFunctions()
#### Step 1 --- Define your conditions under study and create design data.frame
Design <- createDesign(N = c(10, 20, 30))
#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions
# help(Generate)
Generate <- function(condition, fixed_objects) {
dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
dat
}
# help(Analyse)
Analyse <- function(condition, dat, fixed_objects) {
ret <- c(mean=mean(dat)) # mean of the sample data vector
ret
}
# help(Summarise)
Summarise <- function(condition, results, fixed_objects) {
# mean and SD summary of the sample means
ret <- c(mu=mean(results$mean), SE=sd(results$mean))
ret
}
#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design
# run the simulation in testing mode (replications = 2)
Final <- runSimulation(design=Design, replications=2,
generate=Generate, analyse=Analyse, summarise=Summarise)
Final
SimResults(Final)
# reproduce exact simulation
Final_rep <- runSimulation(design=Design, replications=2, seed=Final$SEED,
generate=Generate, analyse=Analyse, summarise=Summarise)
Final_rep
SimResults(Final_rep)
## Not run:
# run with more standard number of replications
Final <- runSimulation(design=Design, replications=1000,
generate=Generate, analyse=Analyse, summarise=Summarise)
Final
SimResults(Final)
#~~~~~~~~~~~~~~~~~~~~~~~~
#### Extras
# compare SEs estimates to the true SEs from the formula sigma/sqrt(N)
5 / sqrt(Design$N)
# To store the results from the analyse function either
# a) omit a definition of summarise() to return all results,
# b) use store_results = TRUE (default) to store results internally and later
# extract with SimResults(), or
# c) pass save_results = TRUE to runSimulation() and read the results in with SimResults()
#
# Note that method c) should be adopted for larger simulations, particularly
# if RAM storage could be an issue and error/warning message information is important.
# a) approach
res <- runSimulation(design=Design, replications=100,
generate=Generate, analyse=Analyse)
res
# b) approach (store_results = TRUE by default)
res <- runSimulation(design=Design, replications=100,
generate=Generate, analyse=Analyse, summarise=Summarise)
res
SimResults(res)
# c) approach
Final <- runSimulation(design=Design, replications=100, save_results=TRUE,
generate=Generate, analyse=Analyse, summarise=Summarise)
# read-in all conditions (can be memory heavy)
res <- SimResults(Final)
res
head(res[[1]]$results)
# just first condition
res <- SimResults(Final, which=1)
head(res$results)
dplyr::tibble(res$condition, res$results)
# obtain empirical bootstrapped CIs during an initial run
# the simulation was completed (necessarily requires save_results = TRUE)
res <- runSimulation(design=Design, replications=1000, boot_method = 'basic',
generate=Generate, analyse=Analyse, summarise=Summarise)
res
# alternative bootstrapped CIs that uses saved results via reSummarise().
# Default directory save to:
dirname <- paste0('SimDesign-results_', unname(Sys.info()['nodename']), "/")
res <- reSummarise(summarise=Summarise, dir=dirname, boot_method = 'basic')
res
# remove the saved results from the hard-drive if you no longer want them
SimClean(results = TRUE)
## End(Not run)
#-------------------------------------------------------------------------------
# Example 2: t-test and Welch test when varying sample size, group sizes, and SDs
# skeleton functions to be saved and edited
SimFunctions()
## Not run:
# in real-world simulations it's often better/easier to save
# these functions directly to your hard-drive with
SimFunctions('my-simulation')
## End(Not run)
#### Step 1 --- Define your conditions under study and create design data.frame
Design <- createDesign(sample_size = c(30, 60, 90, 120),
group_size_ratio = c(1, 4, 8),
standard_deviation_ratio = c(.5, 1, 2))
Design
#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions
Generate <- function(condition, fixed_objects) {
N <- condition$sample_size # could use Attach() to make objects available
grs <- condition$group_size_ratio
sd <- condition$standard_deviation_ratio
if(grs < 1){
N2 <- N / (1/grs + 1)
N1 <- N - N2
} else {
N1 <- N / (grs + 1)
N2 <- N - N1
}
group1 <- rnorm(N1)
group2 <- rnorm(N2, sd=sd)
dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
dat
}
Analyse <- function(condition, dat, fixed_objects) {
welch <- t.test(DV ~ group, dat)$p.value
independent <- t.test(DV ~ group, dat, var.equal=TRUE)$p.value
# In this function the p values for the t-tests are returned,
# and make sure to name each element, for future reference
ret <- nc(welch, independent)
ret
}
Summarise <- function(condition, results, fixed_objects) {
#find results of interest here (e.g., alpha < .1, .05, .01)
ret <- EDR(results, alpha = .05)
ret
}
#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design
# first, test to see if it works
res <- runSimulation(design=Design, replications=2,
generate=Generate, analyse=Analyse, summarise=Summarise)
res
## Not run:
# complete run with 1000 replications per condition
res <- runSimulation(design=Design, replications=1000, parallel=TRUE,
generate=Generate, analyse=Analyse, summarise=Summarise)
res
View(res)
## save final results to a file upon completion, and play a beep when done
runSimulation(design=Design, replications=1000, parallel=TRUE, filename = 'mysim',
generate=Generate, analyse=Analyse, summarise=Summarise, beep=TRUE)
## same as above, but send a notification via Pushbullet upon completion
library(RPushbullet) # read-in default JSON file
runSimulation(design=Design, replications=1000, parallel=TRUE, filename = 'mysim',
generate=Generate, analyse=Analyse, summarise=Summarise,
notification = 'complete')
## Submit as RStudio job (requires job package and active RStudio session)
job::job({
res <- runSimulation(design=Design, replications=100,
generate=Generate, analyse=Analyse, summarise=Summarise)
}, title='t-test simulation')
res # object res returned to console when completed
## Debug the generate function. See ?browser for help on debugging
## Type help to see available commands (e.g., n, c, where, ...),
## ls() to see what has been defined, and type Q to quit the debugger
runSimulation(design=Design, replications=1000,
generate=Generate, analyse=Analyse, summarise=Summarise,
parallel=TRUE, debug='generate')
## Alternatively, place a browser() within the desired function line to
## jump to a specific location
Summarise <- function(condition, results, fixed_objects) {
#find results of interest here (e.g., alpha < .1, .05, .01)
browser()
ret <- EDR(results[,nms], alpha = .05)
ret
}
## The following debugs the analyse function for the
## second row of the Design input
runSimulation(design=Design, replications=1000,
generate=Generate, analyse=Analyse, summarise=Summarise,
parallel=TRUE, debug='analyse-2')
####################################
## EXTRA: To run the simulation on a user-define cluster, use the following setup (not run)
## Network linked via ssh (two way ssh key-paired connection must be
## possible between master and slave nodes)
##
## Define IP addresses, including primary IP
primary <- '192.168.2.20'
IPs <- list(
list(host=primary, user='phil', ncore=8),
list(host='192.168.2.17', user='phil', ncore=8)
)
spec <- lapply(IPs, function(IP)
rep(list(list(host=IP$host, user=IP$user)), IP$ncore))
spec <- unlist(spec, recursive=FALSE)
cl <- parallel::makeCluster(type='PSOCK', master=primary, spec=spec)
res <- runSimulation(design=Design, replications=1000, parallel = TRUE,
generate=Generate, analyse=Analyse, summarise=Summarise, cl=cl)
## Using parallel='future' to allow the future framework to be used instead
library(future) # future structure to be used internally
plan(multisession) # specify different plan (default is sequential)
res <- runSimulation(design=Design, replications=100, parallel='future',
generate=Generate, analyse=Analyse, summarise=Summarise)
head(res)
# The progressr package is used for progress reporting with futures. To redefine
# use progressr::handlers() (see below)
library(progressr)
with_progress(res <- runSimulation(design=Design, replications=100, parallel='future',
generate=Generate, analyse=Analyse, summarise=Summarise))
head(res)
# re-define progressr's bar (below requires cli)
handlers(handler_pbcol(
adjust = 1.0,
complete = function(s) cli::bg_red(cli::col_black(s)),
incomplete = function(s) cli::bg_cyan(cli::col_black(s))
))
with_progress(res <- runSimulation(design=Design, replications=100, parallel='future',
generate=Generate, analyse=Analyse, summarise=Summarise))
# reset future computing plan when complete (good practice)
plan(sequential)
####################################
###### Post-analysis: Analyze the results via functions like lm() or SimAnova(), and create
###### tables(dplyr) or plots (ggplot2) to help visualize the results.
###### This is where you get to be a data analyst!
library(dplyr)
res %>% summarise(mean(welch), mean(independent))
res %>% group_by(standard_deviation_ratio, group_size_ratio) %>%
summarise(mean(welch), mean(independent))
# quick ANOVA analysis method with all two-way interactions
SimAnova( ~ (sample_size + group_size_ratio + standard_deviation_ratio)^2, res,
rates = TRUE)
# or more specific ANOVAs
SimAnova(independent ~ (group_size_ratio + standard_deviation_ratio)^2,
res, rates = TRUE)
# make some plots
library(ggplot2)
library(tidyr)
dd <- res %>%
select(group_size_ratio, standard_deviation_ratio, welch, independent) %>%
pivot_longer(cols=c('welch', 'independent'), names_to = 'stats')
dd
ggplot(dd, aes(factor(group_size_ratio), value)) + geom_boxplot() +
geom_abline(intercept=0.05, slope=0, col = 'red') +
geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
facet_wrap(~stats)
ggplot(dd, aes(factor(group_size_ratio), value, fill = factor(standard_deviation_ratio))) +
geom_boxplot() + geom_abline(intercept=0.05, slope=0, col = 'red') +
geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
facet_grid(stats~standard_deviation_ratio) +
theme(legend.position = 'none')
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.