In jeremyrcoyle/sl3: Pipelines for Machine Learning and Super Learning

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

R/`sl3`: Super Machine Learning with Pipelines

A flexible implementation of the Super Learner ensemble machine learning system

Authors: Jeremy Coyle, Nima Hejazi, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin

What's `sl3`?

sl3 is an implementation of the Super Learner ensemble machine learning algorithm of @vdl2007super. The Super Learner algorithm performs ensemble learning in one of two fashions:

The discrete Super Learner can be used to select the best prediction algorithm from among a supplied library of machine learning algorithms ("learners" in the sl3 nomenclature) -- that is, the discrete Super Learner is the single learning algorithm that minimizes the cross-validated risk.
The ensemble Super Learner can be used to assign weights to a set of specified learning algorithms (from a user-supplied library of such algorithms) so as to create a combination of these learners that minimizes the cross-validated risk. This notion of weighted combinations has also been referred to as stacked regression [@breiman1996stacked] and stacked generalization [@wolpert1992stacked].

Looking for long-form documentation or a walkthrough of the sl3 package? Don't worry! Just browse the chapter in our book.

Installation

Install the most recent version from the master branch on GitHub via remotes:

remotes::install_github("tlverse/sl3")

Past stable releases may be located via the releases page on GitHub and may be installed by including the appropriate major version tag. For example,

remotes::install_github("tlverse/sl3@v1.3.7")

To contribute, check out the devel branch and consider submitting a pull request.

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.

Examples

sl3 makes the process of applying screening algorithms, learning algorithms, combining both types of algorithms into a stacked regression model, and cross-validating this whole process essentially trivial. The best way to understand this is to see the sl3 package in action:

set.seed(49753)
library(tidyverse)
library(data.table)
library(SuperLearner)
library(origami)
library(sl3)

# load example data set
data(cpp)
cpp <- cpp %>%
  dplyr::filter(!is.na(haz)) %>%
  mutate_all(~ replace(., is.na(.), 0))

# use covariates of intest and the outcome to build a task object
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
            "sexn")
task <- sl3_Task$new(
  data = cpp,
  covariates = covars,
  outcome = "haz"
)

# set up screeners and learners via built-in functions and pipelines
slscreener <- Lrnr_pkg_SuperLearner_screener$new("screen.glmnet")
glm_learner <- Lrnr_glm$new()
screen_and_glm <- Pipeline$new(slscreener, glm_learner)
SL.glmnet_learner <- Lrnr_pkg_SuperLearner$new(SL_wrapper = "SL.glmnet")

# stack learners into a model (including screeners and pipelines)
learner_stack <- Stack$new(SL.glmnet_learner, glm_learner, screen_and_glm)
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()
head(preds)

Parallelization with `future`s

While it's straightforward to fit a stack of learners (as above), it's easy to take advantage of sl3's built-in parallelization support too. To do this, you can simply choose a plan() from the future ecosystem.

# let's load the future package and set 4 cores for parallelization
library(future)
plan(multicore, workers = 4L)

# now, let's re-train our Stack in parallel
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()

Controlling the number of CV folds

In the above examples, we fit stacks of learners, but didn't create a Super Learner ensemble, which uses cross-validation (CV) to build the ensemble model. For the sake of computational expedience, we may be interested in lowering the number of CV folds (from 10). Let's take a look at how to do both below.

# first, let's instantiate some more learners and create a Super Learner
mean_learner <- Lrnr_mean$new()
rf_learner <- Lrnr_ranger$new()
sl <- Lrnr_sl$new(mean_learner, glm_learner, rf_learner)

# CV folds are controlled in the sl3_Task object; we can lower the number of
# folds simply by specifying this in creating the Task
task <- sl3_Task$new(
  data = cpp,
  covariates = covars,
  outcome = "haz",
  folds = 5L
)

# now, let's fit the Super Learner with just 5-fold CV, then get predictions
sl_fit <- sl$train(task)
sl_preds <- sl_fit$predict()

The folds argument to sl3_Task supports both integers (for V-fold CV) and all of the CV schemes supported in the origami package. To see the full list, query ?fold_funs from within R or take a look at origami's online documentation.

Learner Properties

Properties supported by sl3 learners are presented in the following table:

library(sl3)
library(kableExtra)

all_properties <- sl3_list_properties()

get_all_learners <- function() {
  # search for objects named like sl3 learners
  learner_names <- apropos("^Lrnr_")
  learners <- mget(learner_names, inherits = TRUE)

  # verify that learner inherits from Lrnr_base (and is therefore an actual
  # sl3 Learner)
  is_learner_real <- sapply(learners, `[[`, "inherit") == "Lrnr_base"
  return(learners[which(is_learner_real)])
}
all_learners <- get_all_learners()

# get a list where each element is a list of properties of a particular learner
get_learner_class_properties <- function(learner_class) {
  return(learner_class$private_fields$.properties)
}
learner_property_lst <- lapply(all_learners, get_learner_class_properties)

# get a list where each element is a list of logicals indicating whether a
# learner has a property or not
get_learner_class_properties_ownership <- function(properties) {
  return(sl3_list_properties() %in% unlist(properties))
}
learner_property_ownership <- lapply(learner_property_lst,
                                     get_learner_class_properties_ownership)

get_learner_name <- function(learner_class) {
  return(learner_class$classname)
}
learner_names <- lapply(all_learners, get_learner_name)

# generate matrix whose columns are properties and rows are learner names
final_mat <- matrix(unlist(learner_property_ownership),
                    nrow = length(learner_property_ownership),
                    ncol = length(sl3_list_properties()), byrow = TRUE)
dimnames(final_mat) <- list(unlist(learner_names), sl3_list_properties())
final_mat.df <- as.data.frame(final_mat)

char_final_mat.df <- ifelse(final_mat.df == TRUE, "√", "x")
kable(char_final_mat.df) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed",
                                      "responsive")) %>%
  scroll_box(width = "100%", height = "200px", fixed_thead = T)

Contributions

Contributions are very welcome. Interested contributors should consult our contribution guidelines prior to submitting a pull request.

Citation

After using the sl3 R package, please cite the following:

    @software{coyle2021sl3-rpkg,
      author = {Coyle, Jeremy R and Hejazi, Nima S and Malenica, Ivana and
        Phillips, Rachael V and Sofrygin, Oleg},
      title = {{sl3}: Modern Pipelines for Machine Learning and {Super
        Learning}},
      year = {2021},
      howpublished = {\url{https://github.com/tlverse/sl3}},
      note = {{R} package version 1.4.2},
      url = {https://doi.org/10.5281/zenodo.1342293},
      doi = {10.5281/zenodo.1342293}
    }

License

The contents of this repository are distributed under the GPL-3 license. See file LICENSE for details.

References

jeremyrcoyle/sl3 documentation built on Nov. 18, 2024, 4:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jeremyrcoyle/sl3
Pipelines for Machine Learning and Super Learning

In jeremyrcoyle/sl3: Pipelines for Machine Learning and Super Learning

R/`sl3`: Super Machine Learning with Pipelines

What's `sl3`?

Installation

Issues

Examples

Parallelization with `future`s

Controlling the number of CV folds

Learner Properties

Contributions

Citation

License

References

R Package Documentation

Browse R Packages

We want your feedback!

jeremyrcoyle/sl3 Pipelines for Machine Learning and Super Learning

In jeremyrcoyle/sl3: Pipelines for Machine Learning and Super Learning

R/sl3: Super Machine Learning with Pipelines

What's sl3?

Installation

Issues

Examples

Parallelization with futures

Controlling the number of CV folds

Learner Properties

Contributions

Citation

License

References

R Package Documentation

Browse R Packages

We want your feedback!

jeremyrcoyle/sl3
Pipelines for Machine Learning and Super Learning

R/`sl3`: Super Machine Learning with Pipelines

What's `sl3`?

Parallelization with `future`s