runArraySimulation: Run a Monte Carlo simulation using array job submissions per...

View source: R/runArraySimulation.R

runArraySimulationR Documentation

Run a Monte Carlo simulation using array job submissions per condition

Description

This function has the same purpose as runSimulation, however rather than evaluating each row in a design object (potentially with parallel computing architecture) this function evaluates the simulation per independent row condition. This is mainly useful when distributing the jobs to HPC clusters where a job array number is available (e.g., via SLURM), where the simulation results must be saved to independent files as they complete. Use of expandDesign is useful for distributing replications to different jobs, while gen_seeds is required to ensure high-quality random number generation across the array submissions. See the associated vignette for a brief tutorial of this setup.

Usage

runArraySimulation(
  design,
  ...,
  replications,
  iseed,
  filename,
  dirname = NULL,
  arrayID = getArrayID(),
  filename_suffix = paste0("-", arrayID),
  addArrayInfo = TRUE,
  save_details = list(),
  control = list()
)

Arguments

design

design object containing simulation conditions on a per row basis. This function is design to submit each row as in independent job on a HPC cluster. See runSimulation for further details

...

additional arguments to be passed to runSimulation

replications

number of independent replications to perform per condition (i.e., each row in design). See runSimulation for further details

iseed

initial seed to be passed to gen_seeds's argument of the same name, along with the supplied arrayID

filename

file name to save simulation files to (does not need to specify extension). However, the array ID will be appended to each filename (see filename_suffix). For example, if filename = 'mysim' then files stored will be 'mysim-1.rds', 'mysim-2.rds', and so on for each row in design

dirname

directory to save the files associated with filename to. If omitted the files will be stored in the same working directory where the script was submitted

arrayID

array identifier from the scheduler. Must be a number between 1 and nrow(design). If not specified then getArrayID will be called automatically, which assumes the environmental variables are available according the SLURM scheduler

filename_suffix

suffix to add to the filename; default add '-' with the arrayID

addArrayInfo

logical; should the array ID and original design row number be added to the SimExtract(..., what='results') output?

save_details

optional list of extra file saving details. See runSimulation

control

control list passed to runSimulation. In addition to the original control elements two additional arguments have been added: max_time and max_RAM, both of which as specified as character vectors with one element.

max_time specifies the maximum time allowed for a single simulation condition to execute (default does not set any time limits). This is primarily useful when the HPC cluster will time out after some known elapsed time. Following the SBATCH specifications, acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". For example, max_time = "60" indicates a maximum time of 60 minutes, max_time = "03:00:00" a maximum time of 3 hours, max_time = "4-12" a maximum of 4 days and 12 hours, and max_time = "2-02:30:00" a maximum of 2 days, 2 hours and 30 minutes. In general, this input should be set to somewhere around 80-90 before the cluster is terminated can be saved. Default applies no time limit

Similarly, max_RAM controls the (approximate) maximum size that the simulation storage objects can grow before RAM becomes an issue. This can be specified either in terms of megabytes (MB), gigabytes (GB), or terabytes (TB). For example, max_RAM = "4GB" indicates that if the simulation storage objects are larger than 4GB then the workflow will terminate early, returning only the successful results up to this point). Useful for larger HPC cluster jobs with RAM constraints that could terminate abruptly. As a rule of thumb this should be set to around 90 available. Default applies no memory limit

Details

Due to the nature of how the replication are split it is important that the L'Ecuyer-CMRG (2002) method of random seeds is used across all array ID submissions (cf. runSimulation's parallel approach, which uses this method to distribute random seeds within each isolated condition rather than between all conditions). As such, this function requires the seeds to be generated using gen_seeds with the iseed and arrayID inputs to ensure that each job is analyzing a high-quality set of random numbers via L'Ecuyer-CMRG's (2002) method.

Additionally, for timed simulations on HPC clusters it is also recommended to pass a control = list(max_time) value to avoid discarding conditions that require more than the specified time in the shell script. The max_time value should be less than the maximum time allocated on the HPC cluster (e.g., approximately 90 depends on how long each replication takes). Simulations with missing replication information should submit a new set of jobs at a later time to collect the missing replication information.

Author(s)

Phil Chalmers rphilip.chalmers@gmail.com

References

Chalmers, R. P., & Adkins, M. C. (2020). Writing Effective and Reliable Monte Carlo Simulations with the SimDesign Package. The Quantitative Methods for Psychology, 16(4), 248-280. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.20982/tqmp.16.4.p248")}

Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24(3), 136-156. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/10691898.2016.1246953")}

See Also

runSimulation, expandDesign, gen_seeds, aggregate_simulations, getArrayID

Examples


library(SimDesign)

Design <- createDesign(N = c(10, 20, 30))

Generate <- function(condition, fixed_objects = NULL) {
    dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    ret <- c(mean=mean(dat), median=median(dat)) # mean/median of sample data
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL){
    colMeans(results)
}

## Not run: 

# define initial seed (do this only once to keep it constant!)
# iseed <- gen_seeds()
iseed <- 554184288

### On cluster submission, the active array ID is obtained via getArrayID(),
###   and therefore should be used in real SLURM submissions
arrayID <- getArrayID(type = 'slurm')

# However, for the following example array ID is set to first row only
arrayID <- 1L

# run the simulation (results not caught on job submission, only files saved)
res <- runArraySimulation(design=Design, replications=50,
                      generate=Generate, analyse=Analyse,
                      summarise=Summarise, arrayID=arrayID,
                      iseed=iseed, filename='mysim') # saved as 'mysim-1.rds'
res
SimResults(res) # condition and replication count stored

dir()
SimClean('mysim-1.rds')

########################
# Same submission job as above, however split the replications over multiple
# evaluations and combine when complete
Design5 <- expandDesign(Design, 5)
Design5

# iseed <- gen_seeds()
iseed <- 554184288

# arrayID <- getArrayID(type = 'slurm')
arrayID <- 14L

# run the simulation (replications reduced per row, but same in total)
runArraySimulation(design=Design5, replications=10,
                   generate=Generate, analyse=Analyse,
                   summarise=Summarise, iseed=iseed,
                   filename='mylongsim', arrayID=arrayID)

res <- readRDS('mylongsim-14.rds')
res
SimResults(res) # condition and replication count stored

SimClean('mylongsim-14.rds')


###
# Emulate the arrayID distribution, storing all results in a 'sim/' folder
dir.create('sim/')

# Emulate distribution to nrow(Design5) = 15 independent job arrays
##  (just used for presentation purposes on local computer)
sapply(1:nrow(Design5), \(arrayID)
     runArraySimulation(design=Design5, replications=10,
          generate=Generate, analyse=Analyse,
          summarise=Summarise, iseed=iseed, arrayID=arrayID,
          filename='condition', dirname='sim', # files: "sim/condition-#.rds"
          control = list(max_time="04:00:00", max_RAM="4GB"))) |> invisible()

#  If necessary, conditions above will manually terminate before
#  4 hours and 4GB of RAM are used, returning any
#  successfully completed results before the HPC session times
#  out (provided .slurm script specified more than 4 hours)

# list saved files
dir('sim/')

setwd('sim')
condition14 <- readRDS('condition-14.rds')
condition14
SimResults(condition14)

# aggregate simulation results into single file
final <- aggregate_simulations(files=dir())
final

SimResults(final) |> View()

setwd('..')
SimClean(dirs='sim/')


## End(Not run)


philchalmers/SimDesign documentation built on April 29, 2024, 11:43 p.m.