In jeremyrcoyle/sl3: Pipelines for Machine Learning and Super Learning

knitr::opts_chunk$set(fig.width = 7, fig.height = 4.5, dpi = 300,
                      fig.cap = "", fig.align = "center")
showtext::showtext.opts(dpi = 300)
library(sl3)
library(methods)

Accessing these slides

View them online:

https://sl3.tlverse.org/slides/sl3-slides.html
here's a shortened URL: https://goo.gl/fAEhqJ

???

This talk will focus on introducing the new sl3 R package, which provides a modern implementation of the Super Learner algorithm [@vdl2007super], a method for performing stacked regressions [@breiman1996stacked], combined with covariate screening and cross-validation.

Core `sl3` Design Principles

`sl3` Architecture

All of the classes defined in sl3 are based on the R6 framework, which brings a newer object-oriented paradigm to the R language.

Core classes

sl3_Task: Define ML problem (task). Keep track of data, as well as the variables. Created by make_sl3_Task().

Lrnr_base: Base class for defining ML algorithms. Save the fits on particular sl3_Tasks. Different learning algorithms are defined in classes that inherit from this class.

Pipeline: Define a sequential pipe of learners. The fit of one learner is used by the next one.
Example pipeline: 1) Pre-process covariates (screener / PCA) -> 2) Fit the estimator to pre-processed data.

Stack: Stack several ML learners and train them simultaneously on the same data. Their predictions can be either combined or compared.

???

Probably good to point out that cross-validating a Stack allows for an included Pipeline to be subjected to CV in the same way as other learners (seriously awesome feature).
Might be worth mentioning the basic differences between core OOP concepts, for example classes, objects, methods, fields, inheritance, etc.
Because all learners inherit from Lrnr_base, they have many features in common, and can be used interchangeably.
All learners define three main methods: train, predict, and chain.
Pipeline allows for covariate screening and model fitting to be subjected to the same cross-validation process, necessary for Super Learning.

Object Oriented Programming (OOP)

The key concept of OOP is that of an object, a collection of data and functions that corresponds to some conceptual unit. Objects have two main types of elements: fields and methods.

fields are information about an object.

methods are actions an object can perform.

Objects are members of classes, which define what those specific fields and methods are. Classes can inherit elements from other classes.

sl3 is designed using basic OOP principles and the R6 OOP framework, in which methods and fields of a class object are accessed using the $ operator.

The Anatomy of `sl3`

Get the package

Currently, installation from the master branch is the only option:

devtools::install_github("jeremyrcoyle/sl3")

set.seed(49753)
library(data.table)
library(dplyr)
library(origami)
library(SuperLearner)

Of course, the package will be available on CRAN. An initial release is forthcoming.

To start using sl3, let's load the package:

library(sl3)

A "toy" data set

We use data from the Collaborative Perinatal Project (CPP) to illustrate the features of sl3 as well as its proper usage. For convenience, the data is included with the sl3 R package.

# load example data set
data(cpp_imputed)

# here are the covariates we are interested in and, of course, the outcome
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
            "sexn")
outcome <- "haz"

???

Next, we'll walk through analyzing some data.

Setting up `sl3_Task` I

sl3_Task is the core structure that holds the data set.
sl3_Task specifies the covariates / outcome / outcome_type.
These spec's must be respected by all learners that work with this task.

task <- make_sl3_Task(data = cpp_imputed, covariates = covars,
                      outcome = outcome, outcome_type="continuous")

Method make_sl3_Task() created a new sl3_Task.
Specified the underlying data (cpp_imputed), as well as covariates and outcome.
Also specified an outcome_type ("continuous"). Can be "categorical" / "binomial" / "quasibinomial".

`sl3_Task` Options

make_sl3_Task() has many options providing support for a wide range of ML problems.
For example:
id - clustered / repeated-measures data
weights - survey data
offset - TMLE

Setting up `sl3_Task` II

Let's take a look at the task that we set up:

task

Learners I: Introduction

Lrnr_base is the base class for defining ML algorithms.
Saves the fits for those algorithms to particular sl3_Tasks.

Different ML algorithms are defined in classes that inherit from Lrnr_base.

For instance, the Lrnr_glm class inherits from Lrnr_base, and defines a learner that fits GLMs.

Learner objects can be constructed from their class definitions using the make_learner() function:

# make learner object
lrnr_glm <- make_learner(Lrnr_glm)

Learners II: Core Methods

All learners inherit from Lrnr_base, so they have many features in common, and can be used interchangeably.

All learners define three main methods: train, predict, and chain.

The first, train, takes a sl3_task object, and returns a learner_fit, which has the same class as the learner that was trained:

# fit learner to task data
lrnr_glm_fit <- lrnr_glm$train(task)

# verify that the learner is fit
lrnr_glm_fit$is_trained

Learners III: Prediction

Generate default predictions using predict() method:

preds <- lrnr_glm_fit$predict()
head(preds)

Generate predictions for a given new task:

preds <- lrnr_glm_fit$predict(task)
head(preds)

Learners IV: Properties

Learners have properties that indicate what features they support. Use sl3_list_properties() to get a list of all properties supported by at least one learner.

sl3_list_learners(c("binomial", "offset"))

Use sl3_list_learners() to find learners supporting any set of properties:

sl3_list_learners(c("binomial", "offset"))

Learners V: Tuning Parameters

Learners can be instantiated without providing any additional parameters. We tried to provide sensible defaults for each learner.
You can modify the learners' behavior by instantiating learners with different parameters.

sl3 Learners support some common parameters (where applicable):

covariates: subsets covariates before fitting. Allows learners to be fit to the same task with different covariate subsets.
outcome_type: overrides the task outcome_type. Allows learners to be fit to the same task with different outcome_types.
...: arbitrary parameters can be passed directly to the learner method. See documentation for each learner.

Compatibility with `SuperLearner` Package

Defining a sl3 learner that uses the SL.glmnet wrapper from SuperLearner:

lrnr_sl_glmnet <- make_learner(Lrnr_pkg_SuperLearner, "SL.glmnet")

???

In most cases, using wrappers from SuperLearner will not be as efficient as their native sl3 counterparts. If your favorite learner is missing from sl3, please consider adding it by following the "Defining New Learners" vignette.

Dependent Data / Time-series

sl3 supports univariate and multivariate time-series.
Using "bsds" example dataset, we can make arbitrary size forecasts using one of the "time-series" learners:

data(bsds)
task <- sl3_Task$new(bsds, covariates = c("cnt"), outcome = "cnt")
#self exciting threshold autoregressive model
tsDyn_learner <- Lrnr_tsDyn$new(learner="setar", m=1, model="TAR", n.ahead=5)
fit_1 <- tsDyn_learner$train(task)
fit_1$predict(task)

sl3 also supports several different options for cross-validation with time-series data, and ensemble forecasting.

Examples can be found in the "examples" directory on github.

Currently in works: support for spatial data.

Composing Learners in `sl3`

Pipelines I

A pipeline is a set of learners to be fit sequentially, where the fit from one learner is used to define the task for the next learner.

Let's look at one example of chaining via pre-screening of covariates:
Below, we generate a screener object based on the SuperLearner function screen.corP and fit it to our task.
Inspecting the fit, we see that it selected a subset of covariates:

screen_cor <- Lrnr_pkg_SuperLearner_screener$new("screen.corP")
screen_fit <- screen_cor$train(task)
print(screen_fit)

Pipelines II

Next, we call chain() method to return a new task.
This will make the pre-screened data available to next learner.

screened_task <- screen_fit$chain()
print(screened_task)

Pipelines III

We can see that the chained task reduced the covariates to the subset selected by the screener.
Can fit this new task using previously defined GLM learner lrnr_glm:

screened_glm_fit <- lrnr_glm$train(screened_task)
screened_preds <- screened_glm_fit$predict()
head(screened_preds)

Pipelines IV

The Pipeline class automates this process:
Takes an arbitrary number of learners and fits them sequentially, training and chaining each one in turn.
Since Pipeline is a learner like any other, it shares the same interface.

Define a pipeline using make_learner(), and use train and predict just as we did before:

sg_pipeline <- make_learner(Pipeline, screen_cor, lrnr_glm)
sg_pipeline_fit <- sg_pipeline$train(task)
sg_pipeline_preds <- sg_pipeline_fit$predict()
head(sg_pipeline_preds)

We see that the pipeline returns the same predictions as manually training glm on the chained task from the screening learner.

Pipelines V

dt <- delayed_learner_train(sg_pipeline, task)
plot(dt, color=FALSE, height="300px")

The concept of chaining is much more general.
There are many ways in which a learner can define the task for the downstream learner.
The chain() method of each learner specifies how this works.
Any type of pre and post processing of the data fits within the "pipeline" concept.
Just a few examples of pipelines:
PCA (pre-processing) / TMLE / Super-Learner (discussed later)

Stacks I

Like Pipelines, Stacks combine multiple learners. Stacks train learners simultaneously, so that their predictions can be either combined or compared.

Stack is just a special learner and so has the same interface as all other learners.
Here we define and fit a stack of two learners: simple glm learner and a previous pipeline.
Both are fit to same data.

stack <- make_learner(Stack, lrnr_glm, sg_pipeline)
stack_fit <- stack$train(task)
stack_preds <- stack_fit$predict()
head(stack_preds)

???

We could have included any arbitrary set of learners and pipelines, the latter of which are themselves just learners.
We can see that the predict method now returns a matrix, with a column for each learner included in the stack.

Stacks II

dt <- delayed_learner_train(stack, task)
plot(dt, color=FALSE, height="500px")

But What About Cross-validation?

Almost forgot! CV is necessary in order to honestly evaluate our models and avoid over-fitting. We provide facilities for easily doing this, based on the origami package.

The Lrnr_cv learner wraps another learner and performs training and prediction in a cross-validated fashion, using separate training and validation splits as defined by task$folds.

Below, we define a new Lrnr_cv object based on the previously defined stack and train it and generate predictions on the validation set:

cv_stack <- Lrnr_cv$new(stack)
cv_fit <- cv_stack$train(task)
cv_preds <- cv_fit$predict()

Cross-validation (continued...)

We can also use the special Lrnr_cv function cv_risk to estimate cross-validated risk values:

risks <- cv_fit$cv_risk(loss_squared_error)
print(risks)

In this example, we don't see much difference between the two learners, suggesting the addition of the screening step in the pipeline learner didn't improve performance much.

Cross-validation (continued...)

dt <- delayed_learner_train(cv_stack, task)
plot(dt, color=FALSE, height="500px")

Putting it all together: Super Learning

Super Learner I: Meta-Learners

We can combine Pipelines, Stacks, and Lrnr_cv to easily define a Super Learner.

Using some of the objects we defined in the above examples, this becomes a nearly trivial operation:

metalearner <- make_learner(Lrnr_nnls)
cv_task <- cv_fit$chain()
ml_fit <- metalearner$train(cv_task)

Used special learner, Lrnr_nnls, for the meta-learning step. Fits a non-negative least squares meta-learner.
Any learner can be used as a meta-learner.

Super Learner II: Pipelines

The Super Learner is just a pipeline:
with a stack of learners trained on full data and the meta-learner trained on the validation-set predictions.

Below, we use a special behavior of pipelines: if all objects passed to a pipeline are learner fits (i.e., learner$is_trained is TRUE), the result will also be a fit:

sl_pipeline <- make_learner(Pipeline, stack_fit, ml_fit)
sl_preds <- sl_pipeline$predict()
head(sl_preds)

Super Learner III: `Lrnr_sl`

A Super Learner may be fit in a more streamlined manner using the Lrnr_sl learner.

For simplicity, we will use the same set of learners and meta-learning algorithm as we did before:

sl <- Lrnr_sl$new(learners = stack,
                  metalearner = metalearner)
sl_fit <- sl$train(task)
lrnr_sl_preds <- sl_fit$predict()
head(lrnr_sl_preds)

We can see that this generates the same predictions as the more hands-on definition we encountered previously.

???

Worth mentioning that the flexibility offered by our design allows us to invoke the Super Learner algorithm, but we can also do a lot more...

Computing with `delayed`

Delayed I

For large datasets, fitting a Super Learner can be extremely time-consuming.

To alleviate this complication, we've developed a specialized parallelization framework delayed that parallelizes across these tasks in a way that takes into account their inter-dependent nature.

Consider a Super Learner with three learners:

lrnr_rf <- make_learner(Lrnr_randomForest)
lrnr_glmnet <- make_learner(Lrnr_glmnet)
sl <- Lrnr_sl$new(learners = list(lrnr_glm, lrnr_rf, lrnr_glmnet),
                  metalearner = metalearner)

We can plot the network of tasks required to train this Super Learner:

delayed_sl_fit <- delayed_learner_train(sl, task)
plot(delayed_sl_fit)

Delayed II

delayed_sl_fit <- delayed_learner_train(sl, task)
plot(delayed_sl_fit, color = TRUE, height="500px")

Delayed III

shiny demo
delayed then allows us to parallelize the procedure across these tasks using the future package.
n.b., This feature is currently experimental and hasn't yet been throughly tested on a range of parallel back-ends.

Performance comparisons can be found in the "SuperLearner Benchmarks" vignette that accompanies this package.

???

Fitting a Super Learner is composed of many different training and prediction steps, as the procedure requires that the learners in the stack and the meta-learner be fit on cross-validation folds and on the full data.
For more information on specifying future plans for parallelization, see the documentation of the future package.

Thanks!

We have a great team: Jeremy Coyle, Nima Hejazi, Ivana Malenica, Oleg Sofrygin.

Slides created via the R package xaringan.

jeremyrcoyle/sl3 documentation built on April 30, 2024, 10:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jeremyrcoyle/sl3 Pipelines for Machine Learning and Super Learning

In jeremyrcoyle/sl3: Pipelines for Machine Learning and Super Learning

Accessing these slides

View them online:

Core sl3 Design Principles

sl3 Architecture

Core classes

Object Oriented Programming (OOP)

The Anatomy of sl3

Get the package

A "toy" data set

Setting up sl3_Task I

sl3_Task Options

Setting up sl3_Task II

Learners I: Introduction

Learners II: Core Methods

Learners III: Prediction

Learners IV: Properties

Learners V: Tuning Parameters

Compatibility with SuperLearner Package

Dependent Data / Time-series

Composing Learners in sl3

Pipelines I

Pipelines II

Pipelines III

Pipelines IV

Pipelines V

Stacks I

Stacks II

But What About Cross-validation?

Cross-validation (continued...)

Cross-validation (continued...)

Putting it all together: Super Learning

Super Learner I: Meta-Learners

Super Learner II: Pipelines

Super Learner III: Lrnr_sl

Computing with delayed

Delayed I

Delayed II

Delayed III

Thanks!

R Package Documentation

Browse R Packages

We want your feedback!

jeremyrcoyle/sl3
Pipelines for Machine Learning and Super Learning

Core `sl3` Design Principles

`sl3` Architecture

The Anatomy of `sl3`

Setting up `sl3_Task` I

`sl3_Task` Options

Setting up `sl3_Task` II

Compatibility with `SuperLearner` Package

Composing Learners in `sl3`

Super Learner III: `Lrnr_sl`

Computing with `delayed`