library(mlrng)
knitr::opts_chunk$set(
  datatable.print.class = TRUE,
  R.options = list(mlrng.debug = FALSE)
)
set.seed(123)

Building Blocks

The package provides R6 classes for the building blocks of machine learning:

All objects are stored in dictionaries (a.k.a. hash maps): mlr.tasks comes with some predefined toy tasks, mlr.learners with learners, mlr.resamplings with different resampling methods and mlr.measures with some popular performance measures.

Tasks

Task Creation

We use the iris data set to create a multilabel classification task:

iris.task = TaskClassif$new(data = iris, target = "Species")

Task Object

Task objects come with some handy self-explanatory getters:

# id
iris.task$id

# dimension
c(iris.task$nrow, iris.task$ncol)

# name of the target column
iris.task$target

# names of the feature columns
iris.task$features

# formula describing the task
iris.task$formula

# column types
iris.task$col.types

# number of classes
iris.task$nclasses

# class levels
iris.task$classes

# positive class (not feasible for multilabel classification)
iris.task$positive

# missing values per column
iris.task$missing.values

# peek into the data
iris.task$head()

# complete data
iris.task$data

In mlrng, tasks can rely on different data backends to hold tabular data. Per default, data is stored in SQLite data bases where the data does not occupy any memory unless fetched from the data base. This allows you to work with hundreds of tasks simultaneously or to conveniently learn on subsets of "big data". If not configured differently, the temp directory of R is used to store the data base. Alternatively, you can opt to hold the data in memory (for a small performance boost) or to connect to a real DBMS like PostgresSQL or MariaDB. This is covered in detail in the help (FIXME).

The backend treats the data as immutable. While some operations work without touching or querying the data (filtering rows or selecting columns), other (preprocessing) operations will create an in-memory copy of the data (or subset) on the fly.

# subset to 120 rows and remove column "Petal.Length"
keep = setdiff(c(iris.task$target, iris.task$features), "Petal.Length")
iris.task$subset(rows = 1:120, cols = keep)

iris.task$nrow
iris.task$features

If you for example subsample your data first to only use 0.1% of all observations before preprocessing and feeding it into a model, you can keep the memory footprint reasonable.

Predefined Tasks

The package ships with some popular tasks to toy around with. Like most objects in mlrng, tasks are stored in a Dictionary which is called mlr.tasks here:

print(mlr.tasks)
print(mlr.tasks$summary())

We use the $get() function to retrieve a specific task, here a regression task based on the dataset BostonHousing:

bh.task = mlr.tasks$get("bh")
bh.task$head()

Learners

Learners interface statistical learning algorithms which implement two steps: In the first step, they are provided some training data to fit a model. In the second step, this model is used to predict on data where the outcome is unknown. Many popular learners a connected to mlrng in the package mlrnglearners (FIXME) to keep the dependency chain reasonable.

Create Learners

Users and package authors can define their own learners (but be sure to check if someone else already did the job for you!).

Here, we create a new, simple learner which takes a classification problem and predicts the majority class. Learners must follow the these conventions: The training function train() takes a task and a training subset. The train() function should only use the respective subset of the task to build the model. The return value can be an arbitrary R object which will be passed to predict(). The predict function predict() gets the return value of the train function as argument model and the data to predict on as argument newdata.

lrn = LearnerClassif$new(
  name = "majority",
  properties = c("missings", "feat.factor", "feat.numeric"),

  train = function(task, subset, ...) {
    truth = task$get(subset, task$target)[[1L]]
    list(majority.class = names(which.max(table(truth))))
  },

  predict = function(model, newdata, ...) {
    rep.int(model$majority.class, nrow(newdata))
  }
)

Predefined Learners

All learners are stored in a register called mlr.learners and can easily be listed:

mlr.learners
mlr.learners$summary()

Dummy classification learner

You can retrieve learners from the dictionary Learners:

lrn.dummy = mlr.learners$get("classif.dummy")

The parameter set is stored in the slot par.set and parameters deviating from the default are stored in par.vals:

lrn.dummy$par.set
lrn.dummy$par.vals

Now, we set the parameter method to "sample", change the id and add the learner to the register:

lrn.dummy$par.vals = list(method = "sample")
lrn.dummy$id = "classif.dummy.sample"
mlr.learners$add(lrn.dummy)
mlr.learners$summary()

From now on, we can just pass the id "classif.dummy.sample" to other functions to use this learner.

Measures

mlr.measures$summary()
measure = mlr.measures$get("mmce")

Resampling

mlr.resamplings
mlr.resamplings$summary()
r = mlr.resamplings$get("cv")
print(r)

# change to 3-fold cv
r$iters = 3

Train and Predict

Here, we fit a simple CART on a random subset of the iris task. The returned object is a TrainResult:

task = mlr.tasks$get("iris")
lrn = mlr.learners$get("classif.rpart")
set.seed(123); train = sample(task$nrow, 120)
tr = train(task, lrn, subset = train)
print(tr)
tr$train.log

We can access the returned rpart model via the slot $rmodel:

print(tr$rmodel)

Next, we can use the TrainResult to predict on the left-out observations:

test = setdiff(1:task$nrow, train)
pr = predict(tr, subset = test)

Resampling

rr = resample(task = iris.task, learner = lrn.dummy, resampling = r, measures = list(measure))
rr$data
rr$aggr

Benchmarking

tasks = lapply(c("iris", "sonar"), mlr.tasks$get)
learners = lapply(c("classif.dummy", "classif.rpart"), mlr.learners$get)
resamplings = lapply("cv", mlr.resamplings$get)
measures = lapply("mmce", mlr.measures$get)

withr::with_options(list(mlrng.verbose = FALSE), {
  bmr = benchmark(
    tasks = tasks,
    learners = learners,
    resamplings = resamplings,
    measures = measures)
})
bmr$data


mlr-org/mlrng documentation built on May 4, 2019, 4:22 p.m.