library(mlrng) knitr::opts_chunk$set( datatable.print.class = TRUE, R.options = list(mlrng.debug = FALSE) ) set.seed(123)
The package provides R6 classes for the building blocks of machine learning:
All objects are stored in dictionaries (a.k.a. hash maps):
mlr.tasks
comes with some predefined toy tasks, mlr.learners
with learners, mlr.resamplings
with different resampling methods and mlr.measures
with some popular performance measures.
We use the iris
data set to create a multilabel classification task:
iris.task = TaskClassif$new(data = iris, target = "Species")
Task objects come with some handy self-explanatory getters:
# id iris.task$id # dimension c(iris.task$nrow, iris.task$ncol) # name of the target column iris.task$target # names of the feature columns iris.task$features # formula describing the task iris.task$formula # column types iris.task$col.types # number of classes iris.task$nclasses # class levels iris.task$classes # positive class (not feasible for multilabel classification) iris.task$positive # missing values per column iris.task$missing.values # peek into the data iris.task$head() # complete data iris.task$data
In mlrng
, tasks can rely on different data backends to hold tabular data.
Per default, data is stored in SQLite data bases where the data does not occupy any memory unless fetched from the data base.
This allows you to work with hundreds of tasks simultaneously or to conveniently learn on subsets of "big data".
If not configured differently, the temp directory of R is used to store the data base.
Alternatively, you can opt to hold the data in memory (for a small performance boost) or to connect to a real DBMS like PostgresSQL or MariaDB.
This is covered in detail in the help (FIXME).
The backend treats the data as immutable. While some operations work without touching or querying the data (filtering rows or selecting columns), other (preprocessing) operations will create an in-memory copy of the data (or subset) on the fly.
# subset to 120 rows and remove column "Petal.Length" keep = setdiff(c(iris.task$target, iris.task$features), "Petal.Length") iris.task$subset(rows = 1:120, cols = keep) iris.task$nrow iris.task$features
If you for example subsample your data first to only use 0.1% of all observations before preprocessing and feeding it into a model, you can keep the memory footprint reasonable.
The package ships with some popular tasks to toy around with.
Like most objects in mlrng
, tasks are stored in a Dictionary
which is called mlr.tasks
here:
print(mlr.tasks) print(mlr.tasks$summary())
We use the $get()
function to retrieve a specific task, here a regression task based on the dataset BostonHousing
:
bh.task = mlr.tasks$get("bh") bh.task$head()
Learners interface statistical learning algorithms which implement two steps:
In the first step, they are provided some training data to fit a model.
In the second step, this model is used to predict on data where the outcome is unknown.
Many popular learners a connected to mlrng
in the package mlrnglearners
(FIXME) to keep the dependency chain reasonable.
Users and package authors can define their own learners (but be sure to check if someone else already did the job for you!).
Here, we create a new, simple learner which takes a classification problem and predicts the majority class.
Learners must follow the these conventions:
The training function train()
takes a task and a training subset.
The train()
function should only use the respective subset of the task to build the model.
The return value can be an arbitrary R object which will be passed to predict()
.
The predict function predict()
gets the return value of the train function as argument model
and
the data to predict on as argument newdata
.
lrn = LearnerClassif$new( name = "majority", properties = c("missings", "feat.factor", "feat.numeric"), train = function(task, subset, ...) { truth = task$get(subset, task$target)[[1L]] list(majority.class = names(which.max(table(truth)))) }, predict = function(model, newdata, ...) { rep.int(model$majority.class, nrow(newdata)) } )
All learners are stored in a register called mlr.learners
and can easily be listed:
mlr.learners mlr.learners$summary()
You can retrieve learners from the dictionary Learners
:
lrn.dummy = mlr.learners$get("classif.dummy")
The parameter set is stored in the slot par.set
and parameters deviating from the default are stored in par.vals
:
lrn.dummy$par.set lrn.dummy$par.vals
Now, we set the parameter method
to "sample"
, change the id
and add the learner to the register:
lrn.dummy$par.vals = list(method = "sample") lrn.dummy$id = "classif.dummy.sample" mlr.learners$add(lrn.dummy) mlr.learners$summary()
From now on, we can just pass the id "classif.dummy.sample"
to other functions to use this learner.
mlr.measures$summary() measure = mlr.measures$get("mmce")
mlr.resamplings mlr.resamplings$summary() r = mlr.resamplings$get("cv") print(r) # change to 3-fold cv r$iters = 3
Here, we fit a simple CART on a random subset of the iris task.
The returned object is a TrainResult
:
task = mlr.tasks$get("iris") lrn = mlr.learners$get("classif.rpart") set.seed(123); train = sample(task$nrow, 120) tr = train(task, lrn, subset = train) print(tr) tr$train.log
We can access the returned rpart
model via the slot $rmodel
:
print(tr$rmodel)
Next, we can use the TrainResult
to predict on the left-out observations:
test = setdiff(1:task$nrow, train) pr = predict(tr, subset = test)
rr = resample(task = iris.task, learner = lrn.dummy, resampling = r, measures = list(measure)) rr$data rr$aggr
tasks = lapply(c("iris", "sonar"), mlr.tasks$get) learners = lapply(c("classif.dummy", "classif.rpart"), mlr.learners$get) resamplings = lapply("cv", mlr.resamplings$get) measures = lapply("mmce", mlr.measures$get) withr::with_options(list(mlrng.verbose = FALSE), { bmr = benchmark( tasks = tasks, learners = learners, resamplings = resamplings, measures = measures) }) bmr$data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.