knitr::opts_chunk$set(fig.width = 7, fig.height = 4.5, dpi = 300, fig.cap = "", fig.align = "center") showtext::showtext.opts(dpi = 300) library(sl3) library(methods)
--
here's a shortened URL: https://goo.gl/fAEhqJ
???
sl3
R package, which provides a
modern implementation of the Super Learner algorithm [@vdl2007super], a method
for performing stacked regressions [@breiman1996stacked], combined with
covariate screening and cross-validation.class: inverse, center, middle
sl3
Design Principlessl3
ArchitectureAll of the classes defined in sl3
are based on the R6 framework, which brings
a newer object-oriented paradigm to the R language.
sl3_Task
: Define ML problem (task). Keep track of data, as well as the variables. Created by make_sl3_Task()
.--
Lrnr_base
: Base class for defining ML algorithms. Save the
fits on particular sl3_Task
s. Different learning
algorithms are defined in classes that inherit from this class.--
Pipeline
: Define a sequential pipe of learners. The fit of one learner is used by the next one. --
Stack
: Stack several ML learners and train them
simultaneously on the same data. Their predictions can be either combined or
compared.???
Stack
allows for an
included Pipeline
to be subjected to CV in the same way as other learners
(seriously awesome feature).Lrnr_base
, they have many features in
common, and can be used interchangeably.train
, predict
, and chain
.Pipeline
allows for covariate screening and model fitting to be subjected to
the same cross-validation process, necessary for Super Learning.--
--
--
--
sl3
is designed using basic OOP principles and the R6 OOP framework, in
which methods and fields of a class object are accessed using the $
operator.class: inverse, center, middle
sl3
master
branch is the only option:devtools::install_github("jeremyrcoyle/sl3")
set.seed(49753) library(data.table) library(dplyr) library(origami) library(SuperLearner)
--
--
To start using sl3
, let's load the package:
library(sl3)
We use data from the Collaborative Perinatal Project (CPP) to illustrate the features of sl3
as well as its
proper usage. For convenience, the data is included with the sl3
R package.
# load example data set data(cpp_imputed) # here are the covariates we are interested in and, of course, the outcome covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn") outcome <- "haz"
???
sl3_Task
Isl3_Task
is the core structure that holds the data set.
sl3_Task
specifies the covariates / outcome / outcome_type.
These spec's must be respected by all learners that work with this task.
task <- make_sl3_Task(data = cpp_imputed, covariates = covars, outcome = outcome, outcome_type="continuous")
--
Method make_sl3_Task()
created a new sl3_Task
.
Specified the underlying data (cpp_imputed
), as well as covariates and outcome.
Also specified an outcome_type
("continuous"). Can be "categorical" / "binomial" / "quasibinomial".
sl3_Task
Optionsmake_sl3_Task()
has many options providing support for a wide range of ML problems.
For example:
id
- clustered / repeated-measures dataweights
- survey dataoffset
- TMLEsl3_Task
IILet's take a look at the task
that we set up:
task
Lrnr_base
is the base class for defining ML algorithms.sl3_Task
s.--
Lrnr_base
.--
Lrnr_glm
class inherits from Lrnr_base
, and defines a
learner that fits GLMs.--
make_learner()
function:# make learner object lrnr_glm <- make_learner(Lrnr_glm)
Lrnr_base
, so they have many features in common,
and can be used interchangeably.--
train
, predict
, and chain
.--
train
, takes a sl3_task
object, and returns a learner_fit
,
which has the same class as the learner that was trained:# fit learner to task data lrnr_glm_fit <- lrnr_glm$train(task) # verify that the learner is fit lrnr_glm_fit$is_trained
predict()
method:preds <- lrnr_glm_fit$predict() head(preds)
--
preds <- lrnr_glm_fit$predict(task) head(preds)
sl3_list_properties()
to get a list of all properties supported by at least one learner.sl3_list_learners(c("binomial", "offset"))
--
sl3_list_learners()
to find learners supporting any set of properties:sl3_list_learners(c("binomial", "offset"))
Learners can be instantiated without providing any additional parameters. We tried to provide sensible defaults for each learner.
You can modify the learners' behavior by instantiating learners with different parameters.
--
sl3
Learners support some common parameters (where applicable):
covariates
: subsets covariates before fitting. Allows
learners to be fit to the same task with different covariate subsets.
outcome_type
: overrides the task outcome_type
. Allows
learners to be fit to the same task with different outcome_types
.
...
: arbitrary parameters can be passed directly to the learner
method. See documentation for each learner.
SuperLearner
Packagesl3
learner that uses the SL.glmnet
wrapper from SuperLearner
:lrnr_sl_glmnet <- make_learner(Lrnr_pkg_SuperLearner, "SL.glmnet")
???
SuperLearner
will not be as efficient as
their native sl3
counterparts. If your favorite learner is missing from
sl3
, please consider adding it by following the "Defining New Learners"
vignette.sl3
supports univariate and multivariate time-series.
Using "bsds" example dataset, we can make arbitrary size forecasts using one of the "time-series" learners:
data(bsds) task <- sl3_Task$new(bsds, covariates = c("cnt"), outcome = "cnt") #self exciting threshold autoregressive model tsDyn_learner <- Lrnr_tsDyn$new(learner="setar", m=1, model="TAR", n.ahead=5) fit_1 <- tsDyn_learner$train(task) fit_1$predict(task)
--
sl3
also supports several different options for cross-validation with time-series data, and ensemble forecasting. Examples can be found in the "examples" directory on github.
--
class: inverse, center, middle
sl3
--
Let's look at one example of chaining via pre-screening of covariates:
Below, we generate a screener object based on the SuperLearner
function
screen.corP
and fit it to our task.
Inspecting the fit, we see that it selected a subset of covariates:
screen_cor <- Lrnr_pkg_SuperLearner_screener$new("screen.corP") screen_fit <- screen_cor$train(task) print(screen_fit)
chain()
method to return a new task.screened_task <- screen_fit$chain() print(screened_task)
lrnr_glm
:screened_glm_fit <- lrnr_glm$train(screened_task) screened_preds <- screened_glm_fit$predict() head(screened_preds)
Pipeline
class automates this process: Pipeline
is a learner like any other, it shares the same interface.--
make_learner()
, and use train
and predict
just as we did before:sg_pipeline <- make_learner(Pipeline, screen_cor, lrnr_glm) sg_pipeline_fit <- sg_pipeline$train(task) sg_pipeline_preds <- sg_pipeline_fit$predict() head(sg_pipeline_preds)
--
glm
on the chained task from the screening learner.dt <- delayed_learner_train(sg_pipeline, task) plot(dt, color=FALSE, height="300px")
chain()
method of each learner specifies how this works. Pipeline
s, Stack
s combine multiple learners. Stack
s train
learners simultaneously, so that their predictions can be either combined or
compared.--
Stack
is just a special learner and so has the same interface as all
other learners.stack
of two learners: simple glm
learner and a previous pipeline.stack <- make_learner(Stack, lrnr_glm, sg_pipeline) stack_fit <- stack$train(task) stack_preds <- stack_fit$predict() head(stack_preds)
???
We could have included any arbitrary set of learners and pipelines, the latter of which are themselves just learners.
We can see that the predict
method now returns a matrix, with a column for
each learner included in the stack.
dt <- delayed_learner_train(stack, task) plot(dt, color=FALSE, height="500px")
Almost forgot! CV is necessary in order to honestly evaluate our models and
avoid over-fitting. We provide facilities for easily doing this, based on the
origami
package.
--
Lrnr_cv
learner wraps another learner and performs training and
prediction in a cross-validated fashion, using separate training and
validation splits as defined by task$folds
.--
Lrnr_cv
object based on the previously defined
stack
and train it and generate predictions on the validation set:cv_stack <- Lrnr_cv$new(stack) cv_fit <- cv_stack$train(task) cv_preds <- cv_fit$predict()
Lrnr_cv
function cv_risk
to estimate
cross-validated risk values:risks <- cv_fit$cv_risk(loss_squared_error) print(risks)
--
dt <- delayed_learner_train(cv_stack, task) plot(dt, color=FALSE, height="500px")
class: inverse, center, middle
Pipeline
s, Stack
s, and Lrnr_cv
to easily define a Super
Learner.--
metalearner <- make_learner(Lrnr_nnls) cv_task <- cv_fit$chain() ml_fit <- metalearner$train(cv_task)
--
Lrnr_nnls
, for the meta-learning step. Fits a non-negative least squares meta-learner.--
learner$is_trained
is TRUE
), the result
will also be a fit:sl_pipeline <- make_learner(Pipeline, stack_fit, ml_fit) sl_preds <- sl_pipeline$predict() head(sl_preds)
Lrnr_sl
Lrnr_sl
learner.--
sl <- Lrnr_sl$new(learners = stack, metalearner = metalearner) sl_fit <- sl$train(task) lrnr_sl_preds <- sl_fit$predict() head(lrnr_sl_preds)
--
???
class: inverse, center, middle
delayed
--
delayed
that parallelizes across these tasks in a way that takes
into account their inter-dependent nature.--
lrnr_rf <- make_learner(Lrnr_randomForest) lrnr_glmnet <- make_learner(Lrnr_glmnet) sl <- Lrnr_sl$new(learners = list(lrnr_glm, lrnr_rf, lrnr_glmnet), metalearner = metalearner)
--
delayed_sl_fit <- delayed_learner_train(sl, task) plot(delayed_sl_fit)
delayed_sl_fit <- delayed_learner_train(sl, task) plot(delayed_sl_fit, color = TRUE, height="500px")
delayed
then allows us to parallelize the procedure across these tasks using
the future
package.
n.b., This feature is currently experimental and hasn't yet been throughly tested on a range of parallel back-ends.
--
???
Fitting a Super Learner is composed of many different training and prediction steps, as the procedure requires that the learners in the stack and the meta-learner be fit on cross-validation folds and on the full data.
For more information on specifying future
plan
s for parallelization, see
the documentation of the future
package.
class: center, middle
We have a great team: Jeremy Coyle, Nima Hejazi, Ivana Malenica, Oleg Sofrygin.
Slides created via the R package xaringan.
Powered by remark.js, knitr, and R Markdown.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.