sperrorest is parallelized by default from v2.0.0 and higher.
Most users are not familiar with parallelization and have no time/motivation to wrap their head around it. Instead, they just accept to wait "a bit" longer until the process finishes.
While this is no problem for "quick" cross-validation (CV) cases with a low number of repetitions and models which converge quickly, in some cases processing may take up to several months. For example, running a spatial cross-validation using a Generalized Linear Mixed Model (GLMM) with both random effects and a spatial autocorrelation structure on around 1000 observations takes roughly this time, if executed sequentially. Most of the fitting time hereby is devoted to the integration of the spatial autocorrelation structure.
sperrorest comes with four different parallelization modes and also offers sequential execution.
Unless specified otherwise, all cores of the machine are used. Limiting the number of cores makes sense in cases when you want to do other work on your machine while running a cross-validation so that your system stays responsive. Also, if you are working on a server and have, let's say, 48 cores available and want to do a 100 repetition CV. Since most models take roughly the same time to fit, it would be smart to use 34 cores. Taking this number of cores is faster than using 48 because
You need 3 iterations (34 in the first, 68 in the second and finishing in the 3rd) to process all repetitions. During the third iteration, a lot of cores would do nothing else but just wait for the others to finish.
The parallelization overhead, which is mainly caused by splitting and combining all jobs to the workers, would be higher for the case with 48 cores than for 34 cores. Hence, 34 cores will finish faster than 48 cores on 100 repetitions. Of course, when taking 50 cores it would only need 2 worker iterations to process everything which would again speed up the process.
All modes expect
"apply" (including the sequential one) are running on the parallel API of the
future package. It offers a unified, cross-platform API combining all other existing parallel approaches of R into one package. Besides the variety of parallel options to choose from (
cluster, etc.) it also provides a
sequential option. Every options is initiated in the same way:
library(future) registerDoFuture() plan("sequential") # sequential plan("multicore") # parallel (Unix only) plan("multisession") # parallel plan("multiprocess") # parallel plan("cluster") # parallel
Every option has its advantages and disadvantages. Check the
future package vignettes for more information.
Unless specified otherwise, the default parallel mode uses
foreach with the
"cluster" option of the
future package. Package
doFuture takes care that
foreach works with the parallel initialization of the
This option is taken as default because it works cross-platform and provides progress output to the console. Unfortunately, on Windows this output is not shown to the console but needs to be written to a file (default to the current working directory). Another downside is that the global environment needs to copied to every worker before processing starts. Workers are started sequentially and therefore the startup of > 10 workers may take some seconds.
This mode is also cross-platform but uses different functions on Unix/non-Unix systems for actual processing. On Unix, it uses the
pbmcapply package which combines the
pbapply package (provides progress bar for 'apply' functions) and the
future package to speed up processing. On Windows,
pbapply is used which in the end uses
parApply() to setup a cluster like parallelization including a progress bar.
This modes entirely uses the
future package in combination with
future_lapply() as the working horse. It can be used with any
future plan specified via
par_option. It is the fastest mode but provides no progress output.
This mode executes
sperrorest() sequentially. It also runs on the
future API using
doFuture which provide the possibility of sequential execution using
Note that the only argument which needs to be changed is
par_mode here. Subsequently,
par_mode = "foreach",
par_mode = "apply" and
par_mode = "future" were used.
All default settings of each mode were used.
par_mode = "foreach" runs on
par_mode = "future" runs on
pbmcapply in the end since the test was running on a Unix System.
data(ecuador) fo <- slides ~ dem + slope + hcurv + vcurv + log.carea + cslope sperrorest(data = ecuador, formula = fo, model_fun = glm, model_args = list(family = "binomial"), pred_args = list(type = "response"), smp_fun = partition_cv, smp_args = list(repetition = 1:100, nfold = 5), par_args = list(par_mode = "foreach", par_units = 20), benchmark = TRUE, progress = FALSE, importance = TRUE, imp_permutations = 100)
| | foreach | apply | future | | |---------------|---------|-------|--------|---| | runtime (min) | 52.33 | 51.67 | 49.54 | |
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.