title: 'mlr3: A modern object-oriented machine learning framework in R' tags: - R - machine learning - classification - regression authors: - name: Michel Lang orcid: 0000-0001-9754-0393 affiliation: "1, 2" - name: Martin Binder affiliation: 2 - name: Jakob Richter orcid: 0000-0003-4481-5554 affiliation: 1 - name: Patrick Schratz orcid: 0000-0003-0748-6624 affiliation: 2 - name: Florian Pfisterer orcid: 0000-0001-8867-762X affiliation: 2 - name: Stefan Coors orcid: 0000-0002-7465-2146 affiliation: 2 - name: Quay Au orcid: 0000-0002-5252-8902 affiliation: 2 - name: Giuseppe Casalicchio orcid: 0000-0001-5324-5966 affiliation: 2 - name: Lars Kotthoff orcid: 0000-0003-4635-6873 affiliation: 3 - name: Bernd Bischl orcid: 0000-0001-6002-6980 affiliation: 2 affiliations: - name: TU Dortmund University index: 1 - name: LMU Munich index: 2 - name: University of Wyoming index: 3 date: 1 December 2019 bibliography: paper.bib
The R
[@R] package mlr3
and its associated ecosystem of extension packages implements a powerful, object-oriented and extensible framework for machine learning (ML) in R
.
It provides a unified interface to many learning algorithms available on CRAN, augmenting them with model-agnostic general-purpose functionality that is needed in every ML project, for example train-test-evaluation, resampling, preprocessing, hyperparameter tuning, nested resampling, and visualization of results from ML experiments.
The package is a complete reimplementation of the mlr
[@mlr] package that leverages many years of experience and learned best practices to provide a state-of-the-art system that is powerful, flexible, extensible, and maintainable.
We target both practitioners who want to quickly apply ML algorithms to their problems and researchers who want to implement, benchmark, and compare their new methods in a structured environment.
mlr3
is suitable for short scripts that test an idea, for complex multi-stage experiments with advanced functionality that use a broad range of ML functionality, as a foundation to implement new ML (meta-)algorithms (for example AutoML systems), and everything in between.
Functional correctness is ensured through extensive unit and integration tests.
Several other general-purpose ML toolboxes exist for different programing languages.
The most widely used ones are scikit-learn [@sk-learn] for Python , Weka [@weka] for Java, and mlj [@mlj] for Julia.
The most important toolboxes for R
are mlr, caret [@caret] and tidymodels [@tidymodels].
R
The predecessor package mlr
was first released to CRAN in 2013, with the core design and architecture dating back much further.
As with most software, more code was added over time to integrate more ML algorithms, more approaches for feature selection or hyperparameter tuning, more methods to analyze trained models, and many other things.
With each addition, the code base became larger and more difficult to test and maintain, in particular as changes in the dozens of packages that we integrated with mlr
would break our code and prevent releases.
Installing the package with all dependencies and a complete build with all tests would take hours -- we had arrived at a point where adding any new functionality became a major undertaking.
Further, some of the architectural and design decisions made it essentially impossible to support new cross-cutting functionality, for example ML pipelines, or using new R
packages for better performance.
mlr3
takes these lessons learned to heart and now follows these design principles:
mlr3
package provides only the basic building blocks of ML: tasks, a few learners, resampling methods, and performance measures.
Everything else can be installed and loaded separately through additional packages in the mlr3
ecosystem, for example support for other kinds of data, methods for tuning hyperparameters, or integrations for additional ML packages.R
packages, especially data.table
for fast and efficient computations on rectangular data.R6
for a clean object-oriented design, object state changes, and reference semantics.checkmate
[@checkmate].
Return types are documented and automatic type casting for "simplification" is avoided.In addition, we simplified the API considerably by unifying container and result classes. Many result objects are now tabular by mixing data.table
's list-column feature with R6
objects, which also allows for easy and efficient selection and "split-apply-combine" type operations.
In addition to the main mlr3
package, mlr3learners
provides integrations to a careful selection of the most important ML algorithms and packages in R
.
Complex ML workflows (using directed acyclic graphs) that can incorporate preprocessing, (stacking) ensembles, alternative-branch execution, and much more can be built with the mlr3pipelines
package.
Funtionality for hyperparameter tuning and nested resampling of learners and complex pipelines is provided by the mlr3tuning
package.
mlr3filters
integrates many feature filtering techniques and mlr3db
allows direct use of databases as data sources for out-of-memory data.
We are planning and working on many more packages; for example for Bayesian optimization, Hyperband, probabilistic regression, survival analysis, and spatial and temporal data.
A complete list of existing and planned extension packages can be found on the mlr3 wiki.
mlr3
and its ecosystem are documented in numerous manual pages and a comprehensive book (work in progress).
All packages are licensed under GNU Lesser General Public License (LGPL-3).
This work has been funded by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.
This work was partly supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876, A3.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.