library(knitr) knitr::opts_chunk$set(echo = TRUE) library(sl3) library(delayed) library(SuperLearner) library(future) library(ggplot2) library(data.table) library(stringr) library(scales)
This document consists of some simple benchmarks for various choices of Super Learner implementation, wrapper functions, and parallelization schemes. The purpose of this document is two-fold:
uname <- system("uname -a", intern = TRUE) os <- sub(" .*", "", uname) if (os == "Darwin") { cpu_model <- system("sysctl -n machdep.cpu.brand_string", intern = TRUE) cpus_physical <- as.numeric(system("sysctl -n hw.physicalcpu", intern = TRUE)) cpus_logical <- as.numeric(system("sysctl -n hw.logicalcpu", intern = TRUE)) cpu_clock <- system("sysctl -n hw.cpufrequency_max", intern = TRUE) memory <- system("sysctl -n hw.memsize", intern = TRUE) } else if (os == "Linux") { cpu_model <- system("lscpu | grep 'Model name'", intern = TRUE) cpu_model <- gsub("Model name:[[:blank:]]*", "", cpu_model) cpus_logical <- system("lscpu | grep '^CPU(s)'", intern = TRUE) cpus_logical <- as.numeric(gsub("^.*:[[:blank:]]*","", cpus_logical)) tpc <- system("lscpu | grep '^Thread(s) per core'", intern = TRUE) tpc <- as.numeric(gsub("^.*:[[:blank:]]*", "", tpc)) cpus_physical <- cpus_logical / tpc cpu_clock <- as.numeric(gsub("GHz", "", gsub("^.*@", "", cpu_model)))*10^9 memory <- system("cat /proc/meminfo | grep '^MemTotal'", intern = TRUE) memory <- as.numeric(gsub("kB", "", gsub("^.*:", "", memory)))*2^10 } else { stop("unsupported OS") }
r cpu_model
r as.numeric(cpus_physical)
r as.numeric(cpus_logical)
r as.numeric(cpu_clock)/10^9
GHzr round(as.numeric(memory)/2^30, 1)
GBn = 1e4 data(cpp_imputed) cpp_big <- cpp_imputed[sample(nrow(cpp_imputed), n, replace = TRUE), ] covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn") outcome <- "haz" task <- sl3_Task$new(cpp_big, covariates = covars, outcome = outcome, outcome_type = "continuous")
r nrow(task$X)
r ncol(task$X)
SuperLearner
The legacy SuperLearner package serves as a suitable baseline. We can fit it sequentially (no parallelization):
time_SuperLearner_sequential <- system.time({ SuperLearner(task$Y, as.data.frame(task$X), newX = NULL, family = gaussian(), SL.library = c("SL.glmnet", "SL.randomForest", "SL.speedglm"), method = "method.NNLS", id = NULL, verbose = FALSE, control = list(), cvControl = list(), obsWeights = NULL, env = parent.frame()) })
We can also fit it using multicore parallelization, using the mcSuperLearner
function.
options(mc.cores = cpus_physical) time_SuperLearner_multicore <- system.time({ mcSuperLearner(task$Y, as.data.frame(task$X), newX = NULL, family = gaussian(), SL.library = c("SL.glmnet", "SL.randomForest", "SL.speedglm"), method = "method.NNLS", id = NULL, verbose = FALSE, control = list(), cvControl = list(), obsWeights = NULL, env = parent.frame()) })
The SuperLearner
package supports a number of other parallelization schemes,
although these weren't tested here.
sl3
with Legacy SuperLearner
WrappersTo maximize comparability with the legacy implementation, we can use sl3
with
the SuperLearner
wrappers, so that the actual computation used to train the
learners is identical:
sl_glmnet <- Lrnr_pkg_SuperLearner$new("SL.glmnet") sl_random_forest <- Lrnr_pkg_SuperLearner$new("SL.randomForest") sl_speedglm <- Lrnr_pkg_SuperLearner$new("SL.speedglm") nnls_lrnr <- Lrnr_nnls$new() sl3_legacy <- Lrnr_sl$new(list(sl_random_forest, sl_glmnet, sl_speedglm), nnls_lrnr)
sl3
with Native LearnersWe can also use native sl3
learners, which have been rewritten to be
performant on large sample sizes:
lrnr_glmnet <- Lrnr_glmnet$new() random_forest <- Lrnr_randomForest$new() glm_fast <- Lrnr_glm_fast$new() nnls_lrnr <- Lrnr_nnls$new() sl3_native <- Lrnr_sl$new(list(random_forest, lrnr_glmnet, glm_fast), nnls_lrnr)
sl3
Parallelization Optionssl3
uses the delayed package to
parallelize training tasks. Delayed, in turn, uses the
future package to support a range
of parallel back-ends. We test several of these, for both the legacy wrappers
and native learners.
First, sequential evaluation (no parallelization):
plan(sequential) test <- delayed_learner_train(sl3_legacy, task) time_sl3_legacy_sequential <- system.time({ sched <- Scheduler$new(test, SequentialJob) cv_fit <- sched$compute() }) test <- delayed_learner_train(sl3_native, task) time_sl3_native_sequential <- system.time({ sched <- Scheduler$new(test, SequentialJob) cv_fit <- sched$compute() })
Next, multicore parallelization:
plan(multicore, workers = cpus_physical) test <- delayed_learner_train(sl3_legacy, task) time_sl3_legacy_multicore <- system.time({ sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical, verbose = FALSE) cv_fit <- sched$compute() }) test <- delayed_learner_train(sl3_native, task) time_sl3_native_multicore <- system.time({ sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical, verbose = FALSE) cv_fit <- sched$compute() })
We also test multicore parallelization with hyper-threading -- we use a number of workers equal to the number of logical, not physical, cores:
plan(multicore, workers = cpus_logical) test <- delayed_learner_train(sl3_legacy, task) time_sl3_legacy_multicore_ht <- system.time({ sched <- Scheduler$new(test, FutureJob, nworkers = cpus_logical, verbose = FALSE) cv_fit <- sched$compute() }) test <- delayed_learner_train(sl3_native, task) time_sl3_native_multicore_ht <- system.time({ sched <- Scheduler$new(test, FutureJob, nworkers = cpus_logical, verbose = FALSE) cv_fit <- sched$compute() })
Finally, we test parallelization using multisession:
plan(multisession, workers = cpus_physical) test <- delayed_learner_train(sl3_legacy, task) time_sl3_legacy_multisession <- system.time({ sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical, verbose = FALSE) cv_fit <- sched$compute() }) test <- delayed_learner_train(sl3_native, task) time_sl3_native_multisession <- system.time({ sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical, verbose = FALSE) cv_fit <- sched$compute() })
results <- rbind(time_sl3_legacy_sequential, time_sl3_legacy_multicore, time_sl3_legacy_multicore_ht, time_sl3_legacy_multisession, time_sl3_native_sequential, time_sl3_native_multicore, time_sl3_native_multicore_ht, time_sl3_native_multisession, time_SuperLearner_sequential, time_SuperLearner_multicore ) test <- rownames(results) results <- as.data.table(results) invisible(results[, test := gsub("time_", "", test)]) invisible(results[, native := str_detect(test, "native")]) invisible(results[, parallel := !str_detect(test, "sequential")]) results <- results[order(results$elapsed)] invisible(results[, test := factor(test, levels = test)]) breaks = 2^(-20:20) ggplot(results, aes(y = test, x = elapsed, color = factor(native), shape = factor(parallel))) + geom_point(size = 3) + xlab("Time (seconds) -- log scale") + ylab("Test") + theme_bw() + scale_shape_discrete("Parallel Computation") + scale_color_discrete("Native Learners") + scale_x_continuous(trans = log2_trans(), breaks = breaks) + annotation_logticks(base = 2, sides = "b") save(results, file = "benchmark_results.rdata")
We can see that using the native learners results in about a 4x speedup relative
to the legacy wrappers. This can be at least partially explained by the fact
that legacy SL.randomForest
wrapper uses randomForest.formula
for continuous
data, which resorts to using the model.matrix
function, known to be slow on
large datasets. Improvements to the legacy wrappers would probably reduce or
eliminate this difference.
We can also see that multicore parallelization for the legacy SuperLearner
function results in another 4x speedup on this system. Relative to that, the
sl3_legacy_multicore
test results in almost an additional 2x speedup. This can
be explained by the use of delayed parallelization. While mcSuperLearner
parallelizes simply across the $V$ cross-validation folds, delayed
allows
sl3
to parallelize across all training tasks that comprise the SuperLearner,
which is a total of $(V+1)*n_{learners}$ training tasks, $n_{learners}$ is the
number of learners in the library (here 4), and $(V+1)$ is one more than the
number of cross-validation folds, accounting for the re-fit to the full data
typically implemented in the SuperLearner
algorithm. We don't see a
substantial difference between the three parallelization schems for sl3.
These effects appear multiplicative, resulting in the fastest implementation,
sl3_improved_multicore_ht
(sl3
with native learners and hyper-threaded
multicore parallelization), being about 32x faster than the slowest,
SuperLearner_sequential
(Legacy SuperLearner
without parallelization). This
is a dramatic improvement in the time required to run this SuperLearner.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.