In nguyenngocbinh/mlr3_book_vi: R package to build the mlr3 book via bookdown

Technical {#technical}

This chapter provides an overview of technical details of the r mlr_pkg("mlr3") framework.

Parallelization

At first, some details about Parallelization and the usage of the r cran_pkg("future") are given. Parallelization refers to the process of running multiple jobs simultaneously. This process is employed to minimize the necessary computing power. Algorithms consist of both sequential (non-parallelizable) and parallelizable parts. Therefore, parallelization does not always alter performance in a positive substantial manner. Summed up, this sub-chapter illustrates how and when to use parallelization in mlr3.

Database Backends

The section Database Backends describes how to work with database backends that r mlr_pkg("mlr3") supports. Database backends can be helpful for large data processing which does not fit in memory or is stored natively in a database (e.g. SQLite). Specifically when working with large data sets, or when undertaking numerous tasks simultaneously, it can be advantageous to interface out-of-memory data. The section provides an illustration of how to implement Database Backends using of NYC flight data.

Parameters

In the section Parameters instructions are given on how to:

define parameter sets for learners
undertake parameter sampling
apply parameter transformations

For illustrative purposes, this sub-chapter uses the paradox package, the successor of ParamHelpers.

Logging and Verbosity

The sub-chapter on Logging and Verbosity shows how to change the most important settings related to logging. In mlr3 we use the lgr package.

Transition Guide

Lastly, we provide a Transition Guide for users of the old r mlr_pkg("mlr") who want to switch to r mlr_pkg("mlr3").

Parallelization {#parallelization}

Parallelization refers to the process of running multiple jobs in parallel, simultaneously. This process allows for significant savings in computing power.

r gh_pkg("mlr-org/mlr3") uses the r cran_pkg("future") backends for parallelization. Make sure you have installed the required packages r cran_pkg("future") and r cran_pkg("future.apply"):

r gh_pkg("mlr-org/mlr3") is capable of parallelizing a variety of different scenarios. One of the most used cases is to parallelize the r ref("Resampling") iterations. See Section Resampling for a detailed introduction to resampling.

In the following section, we will use the spam task and a simple classification tree ("classif.rpart") to showcase parallelization. We use the r cran_pkg("future") package to parallelize the resampling by selecting a backend via the function r ref("future::plan()"). We use the "multiprocess" backend here which uses threads on UNIX based systems and a "Socket" cluster on Windows.

future::plan("multiprocess")

task = tsk("spam")
learner = lrn("classif.rpart")
resampling = rsmp("subsampling")

time = Sys.time()
resample(task, learner, resampling)
Sys.time() - time

``{block, type='caution'} By default all CPUs of your machine are used unless you specify argumentworkersinfuture::plan()`.

On most systems you should see a decrease in the reported elapsed time.
On some systems (e.g. Windows), the overhead for parallelization is quite large though.
Therefore, it is advised to only enable parallelization for resamplings where each iteration runs at least 10s.

**Choosing the parallelization level**

If you are transitioning from `r cran_pkg("mlr")`, you might be used to selecting different parallelization levels, e.g. for resampling, benchmarking or tuning.
In `r gh_pkg("mlr-org/mlr3")` this is no longer required.
All kind of events are rolled out on the same level.
Therefore, there is no need to decide whether you want to parallelize the tuning OR the resampling.

In `r gh_pkg("mlr-org/mlr3")` this is no longer required.
All kind of events are rolled out on the same level - there is no need to decide whether you want to parallelize the tuning OR the resampling.

Just lean back and let the machine do the work :-)

## Error Handling {#error-handling}

To demonstrate how to properly deal with misbehaving learners, `r gh_pkg("mlr-org/mlr3")` ships with the learner `r ref("mlr_learners_classif.debug", "classif.debug")`:

```r
task = tsk("iris")
learner = lrn("classif.debug")
print(learner)

This learner comes with special hyperparameters that let us control

what conditions should be signaled (message, warning, error, segfault) with what probability,
during which stage the conditions should be signaled (train or predict), and
the ratio of predictions being NA (predict_missing).

learner$param_set

With the learner's default settings, the learner will do nothing special: The learner learns a random label and creates constant predictions.

task = tsk("iris")
learner$train(task)$predict(task)$confusion

We now set a hyperparameter to let the debug learner signal an error during the train step. By default,r gh_pkg("mlr-org/mlr3") does not catch conditions such as warnings or errors raised by third-party code like learners:

learner$param_set$values = list(error_train = 1)
learner$train(tsk("iris"))

If this would be a regular learner, we could now start debugging with r ref("traceback()") (or create a MRE to file a bug report).

However, machine learning algorithms raising errors is not uncommon as algorithms typically cannot process all possible data. Thus, we need a mechanism to

capture all signaled conditions such as messages, warnings and errors so that we can analyze them post-hoc, and
a statistically sound way to proceed the calculation and be able to aggregate over partial results.

These two mechanisms are explained in the following subsections.

Encapsulation

With encapsulation, exceptions do not stop the program flow and all output is logged to the learner (instead of printed to the console). Each r ref("Learner") has a field encapsulate to control how the train or predict steps are executed. One way to encapsulate the execution is provided by the package r cran_pkg("evaluate") (see r ref("encapsulate()") for more details):

task = tsk("iris")
learner = lrn("classif.debug")
learner$param_set$values = list(warning_train = 1, error_train = 1)
learner$encapsulate = c(train = "evaluate", predict = "evaluate")

learner$train(task)

After training the learner, one can access the recorded log via the fields log, warnings and errors:

learner$log
learner$warnings
learner$errors

Another method for encapsulation is implemented in the r cran_pkg("callr") package. r cran_pkg("callr") spawns a new R process to execute the respective step, and thus even guards the current session from segfaults. On the downside, starting new processes comes with a computational overhead.

learner$encapsulate = c(train = "callr", predict = "callr")
learner$param_set$values = list(segfault_train = 1)
learner$train(task = task)
learner$errors

Without a model, it is not possible to get predictions though:

learner$predict(task)

To handle the missing predictions in a graceful way during r ref("resample()") or r ref("benchmark()"), fallback learners are introduced next.

Fallback learners

Fallback learners have the purpose to allow scoring results in cases where a r ref("Learner") is misbehaving in some sense. Some typical examples include:

The learner fails to fit a model during training, e.g., if some convergence criterion is not met or the learner ran out of memory.
The learner fails to predict for some or all observations. A typical case is e.g. new factor levels in the test data.

We first handle the most common case that a learner completely breaks while fitting a model or while predicting on new data. If the learner fails in either of these two steps, we rely on a second learner to generate predictions: the fallback learner.

In the next example, in addition to the debug learner, we attach a simple featureless learner to the debug learner. So whenever the debug learner fails (which is every time with the given parametrization) and encapsulation in enabled, mlr3 falls back to the predictions of the featureless learner internally:

task = tsk("iris")
learner = lrn("classif.debug")
learner$param_set$values = list(error_train = 1)
learner$encapsulate = c(train = "evaluate")
learner$fallback = lrn("classif.featureless")
learner$train(task)
learner

Note that the log contains the captured error (which is also included in the print output), and although we don't have a model, we can still get predictions:

learner$model
prediction = learner$predict(task)
prediction$score()

While the fallback learner is of limited use for this stepwise train-predict procedure, it is invaluable for larger benchmark studies where only few resampling iterations are failing. Here, we need to replace the missing scores with a number in order to aggregate over all resampling iterations. And imputing a number which is equivalent to guessing labels often seems to be the right amount of penalization.

In the following snippet we compare the previously created debug learner with a simple classification tree. We re-parametrize the debug learner to fail in roughly 30% of the resampling iterations during the training step:

learner$param_set$values = list(error_train = 0.3)

bmr = benchmark(benchmark_grid(tsk("iris"), list(learner, lrn("classif.rpart")), rsmp("cv")))
aggr = bmr$aggregate(conditions = TRUE)
aggr

To further investigate the errors, we can extract the r ref("ResampleResult"):

rr = aggr[learner_id == "classif.debug"]$resample_result[[1L]]
rr$errors

A similar yet different problem emerges when a learner predicts only a subset of the observations in the test set (and predicts NA for others). Handling such predictions in a statistically sound way is not straight-forward and a common source for over-optimism when reporting results. Imagine that our goal is to benchmark two algorithms using a 10-fold cross validation on some binary classification task:

Algorithm A is a ordinary logistic regression.
Algorithm B is also a ordinary logistic regression, but with a twist: If the logistic regression is rather certain about the predicted label (> 90% probability), it returns the label and a missing value otherwise.

When comparing the performance of these two algorithms, it is obviously not fair to average over all predictions of algorithm A while only average over the "easy-to-predict" observations for algorithm B. By doing so, algorithm B would easily outperform algorithm A, but you have not factored in that you can not generate predictions for many observations. On the other hand, it is also not feasible to exclude all observations from the test set of a benchmark study where at least one algorithm failed to predict a label. Instead, we proceed by imputing all missing predictions with something naive, e.g., by predicting the majority class with a featureless learner. And as the majority class may depend on the resampling split (or we opt for some other arbitrary baseline learner), it is best to just train a second learner on the same resampling split.

Long story short, if a fallback learner is involved, missing predictions of the base learner will be automatically replaced with predictions from the fallback learner. This is illustrated in the following example:

task = tsk("iris")
learner = lrn("classif.debug")

# this hyperparameter sets the ratio of missing predictions
learner$param_set$values = list(predict_missing = 0.5)

# without fallback
p = learner$train(task)$predict(task)
table(p$response, useNA = "always")

# with fallback
learner$fallback = lrn("classif.featureless")
p = learner$train(task)$predict(task)
table(p$response, useNA = "always")

Summed up, by combining encapsulation and fallback learners, it is possible to benchmark even quite unreliable or instable learning algorithms in a convenient way.

Database Backends {#backends}

In mlr3, r ref("Task")s store their data in an abstract data format, the r ref("DataBackend"). The default backend uses r cran_pkg("data.table") via the r ref("DataBackendDataTable") as an in-memory data base.

For larger data, or when working with many tasks in parallel, it can be advantageous to interface an out-of-memory data. We use the excellent R package r cran_pkg("dbplyr") which extends r cran_pkg("dplyr") to work on many popular data bases like MariaDB, PostgreSQL or SQLite.

Use Case: NYC Flights

To generate a halfway realistic scenario, we use the NYC flights data set from package r cran_pkg("nycflights13"):

# load data
requireNamespace("DBI")
requireNamespace("RSQLite")
requireNamespace("nycflights13")
data("flights", package = "nycflights13")
str(flights)

# add column of unique row ids
flights$row_id = 1:nrow(flights)

# create sqlite database in temporary file
path = tempfile("flights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
tbl = DBI::dbWriteTable(con, "flights", as.data.frame(flights))
DBI::dbDisconnect(con)

# remove in-memory data
rm(flights)

Preprocessing with `dplyr`

With the SQLite database in path, we now re-establish a connection and switch to r cran_pkg("dplyr")/r cran_pkg("dbplyr") for some essential preprocessing.

# establish connection
con = DBI::dbConnect(RSQLite::SQLite(), path)

# select the "flights" table, enter dplyr
library(dplyr)
library(dbplyr)
tbl = tbl(con, "flights")

First, we select a subset of columns to work on:

keep = c("row_id", "year", "month", "day", "hour", "minute", "dep_time",
  "arr_time", "carrier", "flight", "air_time", "distance", "arr_delay")
tbl = select(tbl, keep)

Additionally, we remove those observations where the arrival delay (arr_delay) has a missing value:

tbl = filter(tbl, !is.na(arr_delay))

To keep runtime reasonable for this toy example, we filter the data to only use every second row:

tbl = filter(tbl, row_id %% 2 == 0)

The factor levels of the feature carrier are merged so that infrequent carriers are replaced by level "other":

tbl = mutate(tbl, carrier = case_when(
    carrier %in% c("OO", "HA", "YV", "F9", "AS", "FL", "VX", "WN") ~ "other",
    TRUE ~ carrier)
)

DataBackendDplyr

The processed table is now used to create a r ref("mlr3db::DataBackendDplyr") from r mlr_pkg("mlr3db"):

library("mlr3db")
b = as_data_backend(tbl, primary_key = "row_id")

We can now use the interface of r ref("DataBackend") to query some basic information of the data:

b$nrow
b$ncol
b$head()

Note that the r ref("DataBackendDplyr") does not know about any rows or columns we have filtered out with r cran_pkg("dplyr") before, it just operates on the view we provided.

Model fitting

We create the following r mlr_pkg("mlr3") objects:

A r ref("TaskRegr", text = "regression task"), based on the previously created r ref("mlr3db::DataBackendDplyr").
A regression learner (r ref("mlr_learners_regr.rpart", text = "regr.rpart")).
A resampling strategy: 3 times repeated subsampling using 2\% of the observations for training ("r ref("mlr_resamplings_subsampling", text = "subsampling")")
Measures "r ref("mlr_measures_regr.mse", text = "mse")", "r ref("mlr_measures_time_train", text = "time_predict")" and "r ref("mlr_measures_time_predict", text = "time_predict")"

task = TaskRegr$new("flights_sqlite", b, target = "arr_delay")
learner = lrn("regr.rpart")
measures = mlr_measures$mget(c("regr.mse", "time_train", "time_predict"))
resampling = rsmp("subsampling")
resampling$param_set$values = list(repeats = 3, ratio = 0.02)

We pass all these objects to r ref("resample()") to perform a simple resampling with three iterations. In each iteration, only the required subset of the data is queried from the SQLite data base and passed to r ref("rpart::rpart()"):

rr = resample(task, learner, resampling)
print(rr)
rr$aggregate(measures)

Cleanup

Finally, we remove the tbl object and close the connection.

rm(tbl)
DBI::dbDisconnect(con)

# more cleanups
rm(list = c("b", "task", "learner", "measures", "resampling", "rr"))

Parameters (using `paradox`) {#paradox}

The r mlr_pkg("paradox") package offers a language for the description of parameter spaces, as well as tools for useful operations on these parameter spaces. A parameter space is often useful when describing:

A set of sensible input values for an R function
The set of possible values that slots of a configuration object can take
The search space of an optimization process

The tools provided by paradox therefore relate to:

Parameter checking: Verifying that a set of parameters satisfies the conditions of a parameter space
Parameter sampling: Generating parameter values that lie in the parameter space for systematic exploration of program behavior depending on these parameters

r mlr_pkg("paradox") is, by nature, an auxiliary package that derives its usefulness from other packages that make use of it. It is heavily utilized in other mlr-org packages such as r mlr_pkg("mlr3"), r mlr_pkg("mlr3pipelines"), and r mlr_pkg("mlr3tuning").

Reference Based Objects

r mlr_pkg("paradox") is the spiritual successor to the r cran_pkg("ParamHelpers") package and was written from scratch using the r cran_pkg("R6") class system. The most important consequence of this is that all objects created in paradox are "reference-based", unlike most other objects in R. When a change is made to a ParamSet object, for example by adding a parameter using the $add() function, all variables that point to this ParamSet will contain the changed object. To create an independent copy of a ParamSet, the $clone() method needs to be used:

library("paradox")

ps = ParamSet$new()
ps2 = ps
ps3 = ps$clone(deep = TRUE)
print(ps) # the same for ps2 and ps3

ps$add(ParamLgl$new("a"))

print(ps)  # ps was changed
print(ps2) # contains the same reference as ps
print(ps3) # is a "clone" of the old (empty) ps

Defining a Parameter Space

Single Parameters

The basic building block for describing parameter spaces is the Param class. It represents a single parameter, which usually can take a single atomic value. Consider, for example, trying to configure the rpart package's rpart.control object. It has various components (minsplit, cp, ...) that all take a single value, and that would all be represented by a different instance of a Param object.

The Param class has various sub-classes that represent different value types:

r ref("ParamInt"): Integer numbers
r ref("ParamDbl"): Real numbers
r ref("ParamFct"): String values from a set of possible values, similar to R factors
r ref("ParamLgl"): Truth values (TRUE / FALSE), as logicals in R
r ref("ParamUty"): Parameter that can take any value

A particular instance of a parameter is created by calling the attached $new() function:

library("paradox")
parA = ParamLgl$new(id = "A")
parB = ParamInt$new(id = "B", lower = 0, upper = 10, tags = c("tag1", "tag2"))
parC = ParamDbl$new(id = "C", lower = 0, upper = 4, special_vals = list(NULL))
parD = ParamFct$new(id = "D", levels = c("x", "y", "z"), default = "y")
parE = ParamUty$new(id = "E", custom_check = function(x) checkmate::checkFunction(x))

Every parameter must have:

id - A name for the parameter within the parameter set
default - A default value
special_vals - A list of values that are accepted even if they do not conform to the type
tags - Tags that can be used to organize parameters

The numeric (Int and Dbl) parameters furthermore allow for specification of a lower and upper bound. Meanwhile, the Fct parameter must be given a vector of levels that define the possible states its parameter can take. The Uty parameter can also have a custom_check function that must return TRUE when a value is acceptable and may return a character(1) error description otherwise. The example above defines parE as a parameter that only accepts functions.

All values which are given to the constructor are then accessible from the object for inspection using $. Although all these values can be changed for a parameter after construction, this can be a bad idea and should be avoided when possible.

Instead, a new parameter should be constructed. Besides the possible values that can be given to a constructor, there are also the $class, $nlevels, $is_bounded, $has_default, $storage_type, $is_number and $is_categ slots that give information about a parameter.

A list of all slots can be found in r ref("Param", "?Param").

parB$lower
parA$levels
parE$class

It is also possible to get all information of a Param as data.table by calling as.data.table.

as.data.table(parA)

Type / Range Checking

A Param object offers the possibility to check whether a value satisfies its condition, i.e. is of the right type, and also falls within the range of allowed values, using the $test(), $check(), and $assert() functions. test() should be used within conditional checks and returns TRUE or FALSE, while check() returns an error description when a value does not conform to the parameter (and thus plays well with the r ref("checkmate::assert()") function). assert() will throw an error whenever a value does not fit.

parA$test(FALSE)
parA$test("FALSE")
parA$check("FALSE")

Instead of testing single parameters, it is often more convenient to check a whole set of parameters using a ParamSet.

Parameter Sets

The ordered collection of parameters is handled in a ParamSet^[Although the name is suggestive of a "Set"-valued Param, this is unrelated to the other objects that follow the ParamXxx naming scheme.]. It is initialized using the $new() function and optionally takes a list of Params as argument. Parameters can also be added to the constructed ParamSet using the $add() function. It is even possible to add whole ParamSets to other ParamSets.

ps = ParamSet$new(list(parA, parB))
ps$add(parC)
ps$add(ParamSet$new(list(parD, parE)))
print(ps)

The individual parameters can be accessed through the $params slot. It is also possible to get information about all parameters in a vectorized fashion using mostly the same slots as for individual Params (i.e. $class, $levels etc.), see ?ParamSet for details.

It is possible to reduce ParamSets using the $subset method. Be aware that it modifies a ParamSet in-place, so a "clone" must be created first if the original ParamSet should not be modified.

psSmall = ps$clone()
psSmall$subset(c("A", "B", "C"))
print(psSmall)

Just as for Params, and much more useful, it is possible to get the ParamSet as a data.table using as.data.table(). This makes it easy to subset parameters on certain conditions and aggregate information about them, using the variety of methods provided by data.table.

as.data.table(ps)

Type / Range Checking

Similar to individual Params, the ParamSet provides $test(), $check() and $assert() functions that allow for type and range checking of parameters. Their argument must be a named list with values that are checked against the respective parameters. It is possible to check only a subset of parameters.

ps$check(list(A = TRUE, B = 0, E = identity))
ps$check(list(A = 1))
ps$check(list(Z = 1))

Values in a `ParamSet`

Although a ParamSet fundamentally represents a value space, it also has a slot $values that can contain a point within that space. This is useful because many things that define a parameter space need similar operations (like parameter checking) that can be simplified. The $values slot contains a named list that is always checked against parameter constraints. When trying to set parameter values, e.g. for mlr3 Learners, it is the $values slot of its $param_set that needs to be used.

ps$values = list(A = TRUE, B = 0)
ps$values$B = 1
print(ps$values)

The parameter constraints are automatically checked:

ps$values$B = 100

Dependencies

It is often the case that certain parameters are irrelevant or should not be given depending on values of other parameters. An example would be a parameter that switches a certain algorithm feature (for example regularization) on or off, combined with another parameter that controls the behavior of that feature (e.g. a regularization parameter). The second parameter would be said to depend on the first parameter having the value TRUE.

A dependency can be added using the $add_dep method, which takes both the ids of the "depender" and "dependee" parameters as well as a Condition object. The Condition object represents the check to be performed on the "dependee". Currently it can be created using CondEqual$new() and CondAnyOf$new(). Multiple dependencies can be added, and parameters that depend on others can again be depended on, as long as no cyclic dependencies are introduced.

The consequence of dependencies are twofold: For one, the $check(), $test() and $assert() tests will not accept the presence of a parameter if its dependency is not met. Furthermore, when sampling or creating grid designs from a ParamSet, the dependencies will be respected (see Parameter Sampling, in particular Hierarchical Sampler).

The following example makes parameter D depend on parameter A being FALSE, and parameter B depend on parameter D being one of "x" or "y". This introduces an implicit dependency of B on A being FALSE as well, because D does not take any value if A is TRUE.

ps$add_dep("D", "A", CondEqual$new(FALSE))
ps$add_dep("B", "D", CondAnyOf$new(c("x", "y")))

ps$check(list(A = FALSE, D = "x", B = 1))          # OK: all dependencies met
ps$check(list(A = FALSE, D = "z", B = 1))          # B's dependency is not met
ps$check(list(A = FALSE, B = 1))                   # B's dependency is not met
ps$check(list(A = FALSE, D = "z"))                 # OK: B is absent
ps$check(list(A = TRUE))                           # OK: neither B nor D present
ps$check(list(A = TRUE, D = "x", B = 1))           # D's dependency is not met
ps$check(list(A = TRUE, B = 1))                    # B's dependency is not met

Internally, the dependencies are represented as a data.table, which can be accessed listed in the $deps slot. This data.table can even be mutated, to e.g. remove dependencies. There are no sanity checks done when the $deps slot is changed this way. Therefore it is advised to be cautious.

ps$deps

Vector Parameters

Unlike in the old ParamHelpers package, there are no more vectorial parameters in paradox. Instead, it is now possible to create multiple copies of a single parameter using the $rep function. This creates a ParamSet consisting of multiple copies of the parameter, which can then (optionally) be added to another ParamSet.

ps2d = ParamDbl$new("x", lower = 0, upper = 1)$rep(2)
print(ps2d)

ps$add(ps2d)
print(ps)

It is also possible to use a ParamUty to accept vectorial parameters, which also works for parameters of variable length. A ParamSet containing a ParamUty can be used for parameter checking, but not for sampling. To sample values for a method that needs a vectorial parameter, it is advised to use a parameter transformation function that creates a vector from atomic values.

Assembling a vector from repeated parameters is aided by the parameter's $tags: Parameters that were generated by the $rep() command automatically get tagged as belonging to a group of repeated parameters.

ps$tags

Parameter Sampling

It is often useful to have a list of possible parameter values that can be systematically iterated through, for example to find parameter values for which an algorithm performs particularly well (tuning). paradox offers a variety of functions that allow creating evenly-spaced parameter values in a "grid" design as well as random sampling. In the latter case, it is possible to influence the sampling distribution in more or less fine detail.

A point to always keep in mind while sampling is that only numerical and factorial parameters that are bounded can be sampled from, i.e. not ParamUty. Furthermore, for most samplers ParamInt and ParamDbl must have finite lower and upper bounds.

Parameter Designs

Functions that sample the parameter space fundamentally return an object of the Design class. These objects contain the sampled data as a data.table under the $data slot, and also offer conversion to a list of parameter-values using the $transpose() function.

Grid Design

The generate_design_grid() function is used to create grid designs that contain all combinations of parameter values: All possible values for ParamLgl and ParamFct, and values with a given resolution for ParamInt and ParamDbl. The resolution can be given for all numeric parameters, or for specific named parameters through the param_resolutions parameter.

design = generate_design_grid(psSmall, 2)
print(design)

generate_design_grid(psSmall, param_resolutions = c(B = 1, C = 2))

Random Sampling

paradox offers different methods for random sampling, which vary in the degree to which they can be configured. The easiest way to get a uniformly random sample of parameters is generate_design_random. It is also possible to create "latin hypercube" sampled parameter values using generate_design_lhs, which utilizes the r cran_pkg("lhs") package. LHS-sampling creates low-discrepancy sampled values that cover the parameter space more evenly than purely random values.

pvrand = generate_design_random(ps2d, 500)
pvlhs = generate_design_lhs(ps2d, 500)

par(mar=c(4, 4, 2, 1))
plot(pvrand$data, main = "'random' design", xlim = c(0, 1), ylim=c(0, 1))
plot(pvlhs$data, main = "'lhs' design", xlim = c(0, 1), ylim=c(0, 1))

Generalized Sampling: The `Sampler` Class

It may sometimes be desirable to configure parameter sampling in more detail. paradox uses the Sampler abstract base class for sampling, which has many different sub-classes that can be parameterized and combined to control the sampling process. It is even possible to create further sub-classes of the Sampler class (or of any of its subclasses) for even more possibilities.

Every Sampler object has a sample() function, which takes one argument, the number of instances to sample, and returns a Design object.

1D-Samplers

There is a variety of samplers that sample values for a single parameter. These are Sampler1DUnif (uniform sampling), Sampler1DCateg (sampling for categorical parameters), Sampler1DNormal (normally distributed sampling, truncated at parameter bounds), and Sampler1DRfun (arbitrary 1D sampling, given a random-function). These are initialized with a single Param, and can then be used to sample values.

sampA = Sampler1DCateg$new(parA)
sampA$sample(5)

Hierarchical Sampler

The SamplerHierarchical sampler is an auxiliary sampler that combines many 1D-Samplers to get a combined distribution. Its name "hierarchical" implies that it is able to respect parameter dependencies. This suggests that parameters only get sampled when their dependencies are met.

The following example shows how this works: The Int parameter B depends on the Lgl parameter A being TRUE. A is sampled to be TRUE in about half the cases, in which case B takes a value between 0 and 10. In the cases where A is FALSE, B is set to NA.

psSmall$add_dep("B", "A", CondEqual$new(TRUE))
sampH = SamplerHierarchical$new(psSmall,
  list(Sampler1DCateg$new(parA),
    Sampler1DUnif$new(parB),
    Sampler1DUnif$new(parC))
)
sampled = sampH$sample(1000)
table(sampled$data[, c("A", "B")], useNA = "ifany")

Joint Sampler

Another way of combining samplers is the SamplerJointIndep. SamplerJointIndep also makes it possible to combine Samplers that are not 1D. However, SamplerJointIndep currently can not handle ParamSets with dependencies.

sampJ = SamplerJointIndep$new(
  list(Sampler1DUnif$new(ParamDbl$new("x", 0, 1)),
    Sampler1DUnif$new(ParamDbl$new("y", 0, 1)))
)
sampJ$sample(5)

SamplerUnif

The Sampler used in generate_design_random is the SamplerUnif sampler, which corresponds to a HierarchicalSampler of Sampler1DUnif for all parameters.

Parameter Transformation

While the different Samplers allow for a wide specification of parameter distributions, there are cases where the simplest way of getting a desired distribution is to sample parameters from a simple distribution (such as the uniform distribution) and then transform them. This can be done by assigning a function to the $trafo slot of a ParamSet. The $trafo function is called with two parameters:

The list of parameter values to be transformed as x
The ParamSet itself as param_set

The $trafo function must return a list of transformed parameter values.

The transformation is performed when calling the $transpose function of the Design object returned by a Sampler with the trafo ParamSet to TRUE (the default). The following, for example, creates a parameter that is exponentially distributed:

psexp = ParamSet$new(list(ParamDbl$new("par", 0, 1)))
psexp$trafo = function(x, param_set) {
  x$par = -log(x$par)
  x
}
design = generate_design_random(psexp, 2)
print(design)
design$transpose()  # trafo is TRUE

Compare this to $transpose() without transformation:

design$transpose(trafo = FALSE)

Transformation between Types

Usually the design created with one ParamSet is then used to configure other objects that themselves have a ParamSet which defines the values they take. The ParamSets which can be used for random sampling, however, are restricted in some ways: They must have finite bounds, and they may not contain "untyped" (ParamUty) parameters. $trafo provides the glue for these situations. There is relatively little constraint on the trafo function's return value, so it is possible to return values that have different bounds or even types than the original ParamSet. It is even possible to remove some parameters and add new ones.

Suppose, for example, that a certain method requires a function as a parameter. Let's say a function that summarizes its data in a certain way. The user can pass functions like median() or mean(), but could also pass quantiles or something completely different. This method would probably use the following ParamSet:

methodPS = ParamSet$new(
  list(
    ParamUty$new("fun",
      custom_check = function(x) checkmate::checkFunction(x, nargs = 1))
  )
)
print(methodPS)

If one wanted to sample this method, using one of four functions, a way to do this would be:

samplingPS = ParamSet$new(
  list(
    ParamFct$new("fun", c("mean", "median", "min", "max"))
  )
)

samplingPS$trafo = function(x, param_set) {
  # x$fun is a `character(1)`,
  # in particular one of 'mean', 'median', 'min', 'max'.
  # We want to turn it into a function!
  x$fun = get(x$fun, mode = "function")
  x
}

design = generate_design_random(samplingPS, 2)
print(design)

Note that the Design only contains the column "fun" as a character column. To get a single value as a function, the $transpose function is used.

xvals = design$transpose()
print(xvals[[1]])

We can now check that it fits the requirements set by methodPS, and that fun it is in fact a function:

methodPS$check(xvals[[1]])
xvals[[1]]$fun(1:10)

Imagine now that a different kind of parametrization of the function is desired: The user wants to give a function that selects a certain quantile, where the quantile is set by a parameter. In that case the $transpose function could generate a function in a different way. For interpretability, the parameter is called "quantile" before transformation, and the "fun" parameter is generated on the fly.

samplingPS2 = ParamSet$new(
  list(
    ParamDbl$new("quantile", 0, 1)
  )
)

samplingPS2$trafo = function(x, param_set) {
  # x$quantile is a `numeric(1)` between 0 and 1.
  # We want to turn it into a function!
  list(fun = function(input) quantile(input, x$quantile))
}

design = generate_design_random(samplingPS2, 2)
print(design)

The Design now contains the column "quantile" that will be used by the $transpose function to create the fun parameter. We also check that it fits the requirement set by methodPS, and that it is a function.

xvals = design$transpose()
print(xvals[[1]])
methodPS$check(xvals[[1]])
xvals[[1]]$fun(1:10)

Logging and Verbosity {#logging}

We use the r cran_pkg("lgr") package for logging and progress output.

Because lgr comes with its own exhaustive vignette, we will just briefly give examples how you can change the most important settings related to logging in r mlr_pkg("mlr3").

Available logging levels

lgr comes with certain numeric thresholds which correspond to verbosity levels of the logging. For r mlr_pkg("mlr3") the default is set to 400 which corresponds to level "info". The following ones are available:

library("lgr")
getOption("lgr.log_levels")

Global Setting

lgr comes with a global option called "lgr.default_threshold" which can be set via options(). You can set a specific level in your .Rprofile which is then used for all packages that use the lgr package. This approach may not be desirable if you want to only change the logging level for r mlr_pkg("mlr3").

Changing mlr3 logging levels

To change the setting for r mlr_pkg("mlr3") only, you need to change the threshold of the r mlr_pkg("mlr3") logger like this:

lgr::get_logger("mlr3")$set_threshold("<level>")

Remember that this change only applies to the current R session.

mlr -> mlr3 Transition Guide {#transition}

In case you have already worked with r mlr_pkg("mlr"), you may want to quickstart with r mlr_pkg("mlr3") by looking up the specific equivalent of an element of r mlr_pkg("mlr") in the new version r mlr_pkg("mlr3"). For this, you can use the following table. This table is not complete but should give you an overview about how r mlr_pkg("mlr3") is organized.

# t = "
#     category , mlr             , mlr3              , note
#     Task     , makeClassifTask , TaskClassif$new() , NA
#     Task     , RegrTask        , TaskRegr          , Class
#     Task     , makeRegrTaskm   , TaskRegr$new()    , NA
#     "
# t = read.table(text =  t, sep = ",", header = TRUE)
# t = knitr::kable(t)
# kableExtra::collapse_rows(t, columns = 1)

t = as.data.frame(mlr3misc::rowwise_table(
  ~Category,          ~mlr,                               ~mlr3,                 ~Note,
  "General / Helper", "getCacheDir() / deleteCacheDir()", "Not yet implemented", "---",
  "General / Helper", "configureMlr()",                   "---",                 "---",
  "General / Helper", "getMlrOptions()",                  "---",                 "---",
  "General / Helper", "createDummyFeatures()",            "Not yet implemented", "mlr3pipelines",
  "General / Helper", "crossover()",                      "---",                 "---",
  "General / Helper", "downsample()",                     "Not yet implemented", "---",
  "General / Helper", "generateCalibrationData()",        "Not yet implemented", "---",
  "General / Helper", "generateCritDifferencesData()",    "Not yet implemented", "---",
  "General / Helper", "generateLearningCurveData()",      "Not yet implemented", "mlr3viz",
  "General / Helper", "generatePartialDependenceData()",  "Not yet implemented", "mlr3viz",
  "General / Helper", "generateThreshVsPerfData()",       "Not yet implemented", "mlr3viz",
  "General / Helper", "getCaretParamSet()",               "Not used anymore",    "---",
  "General / Helper", "reimpute() / impute()",            "Not yet implemented", "mlr3pipelines",
  "General / Helper", "fn() / fnr() / fp() / fpr()",      "???",                 "",
  "General / Helper", "tn() / tnr() / tp() / tpr()",      "???",                 "",
  "General / Helper", "summarizeColumns()",               "???",                 "",
  "General / Helper", "summarizeLevels()",                "???",                 "",

  "Task", "Task",                                 "mlr_tasks / Task",                                      "---",
  "Task", "SurvTask",                             "TaskSurv",                                              "mlr3proba",
  "Task", "ClusterTask",                          "mlr_tasks",                                             "---",
  "Task", "MultilabelTask",                       "mlr_tasks",                                             "---",
  "Task", "SpatialTask",                          "Not yet implemented",                                   "mlr3spatiotemporal",
  "Task", "Example tasks (iris.task,mtcars.task)","mlr_tasks$get('iris') / tsk('iris')",                   "---",
  "Task", "convertMLBenchObjToTask()",            "Not yet implemented",                                   "mlr3",
  "Task", "dropFeatures()",                       "Task$select()",                                         "---",
  "Task", "getTaskCosts()",                       "Not yet implemented",                                   "---",
  "Task", "getTaskData()",                        "Task$data()",                                           "---",
  "Task", "getTaskDesc() / getTaskDescription()", "Task$print()",                                          "---",
  "Task", "getTaskFeatureNames()",                "Task$feature_names",                                    "---",
  "Task", "getTaskFormula()",                     "Task$formula",                                          "---",
  "Task", "getTaskId()",                          "Task$id",                                               "---",
  "Task", "getTaskNFeats()",                      "length(Task$feature_names)",                            "---",
  "Task", "getTaskSize()",                        "Task$nrow()",                                           "---",
  "Task", "getTaskTargetNames()",                 "Task$target_names",                                     "---",
  "Task", "getTaskTargets()",                     "as.data.table(Task)[,Task$feature_names,with = FALSE]", "---",
  "Task", "getTaskType()",                        "Task$task_type",                                        "---",
  "Task", "oversample() / undersample()",         "",                                                      "---",

  "Learner", "helpLearner()",                              "Not yet implemented",  "---",
  "Learner", "helpLearnerParam()",                         "Not yet implemented",  "---",
  "Learner", "getLearnerId()",                             "Learner$id",           "---",
  "Learner", "setLearnerId()",                             "Learner$id",           "---",
  "Learner", "getLearnerModel()",                          "Learner$model",        "---",
  "Learner", "getLearnerNote()",                           "Not used anymore",     "---",
  "Learner", "getLearnerPackages()",                       "Learner$packages",     "---",
  "Learner", "getLearnerParVals() / getLearnerParamSet()", "Learner$param_set",    "---",
  "Learner", "getLearnerPredictType()",                    "Learner$predict_type", "---",
  "Learner", "getLearnerShortName()",                      "Learner$predict_type", "---",
  "Learner", "getLearnerType()",                           "Learner$Type",         "---",
  "Learner", "setPredictType()",                           "Learner$Type",         "---",
  "Learner", "getLearnerProperties",                       "???",                  "---",
  "Learner", "getParamSet()",                              "Learner$param_set",    "---",
  "Learner", "trainLearner()",                             "Learner$train()",      "---",
  "Learner", "predictLearner()",                           "Learner$predict()",    "---",
  "Learner", "makeRLearner*()",                            "Learner",              "---",
  "Learner", "generateLearningCurveData()",                "Not yet implemented",  "mlr3viz",
  "Learner", "FailureModel",                               "---",                  "---",
  "Learner", "getFailureModelDump()",                      "---",                  "---",
  "Learner", "getFailureModelMsg()",                       "---",                  "---",
  "Learner", "isFailureModel()",                           "---",                  "---",
  "Learner", "makeLearner() / makeLearners()",             "???",                  "---",

  "Train/Predict/Resample", "train()",                                                                                  "Experiment$train()",   "---",
  "Train/Predict/Resample", "predict()",                                                                                "Experiment$predict()", "---",
  "Train/Predict/Resample", "performance()",                                                                            "Experiment$score()",   "---",
  "Train/Predict/Resample", "makeResampleDesc()",                                                                       "Resampling",           "mlr_resamplings",
  "Train/Predict/Resample", "resample()",                                                                               "resample()",           "---",
  "Train/Predict/Resample", "ResamplePrediction",                                                                       "ResampleResult",       "---",
  "Train/Predict/Resample", "Aggregation / makeAggregation",                                                            "Not yet implemented",  "---",
  "Train/Predict/Resample", "asROCRPrediction()",                                                                       "Not yet implemented",  "---",
  "Train/Predict/Resample", "ConfusionMatrix / getConfMatrix() / calculateConfusionMatrix()",                           "Not yet implemented",  "---",
  "Train/Predict/Resample", "calculateROCMeasures()",                                                                   "Not yet implemented",  "---",
  "Train/Predict/Resample", "estimateRelativeOverfitting()",                                                            "Not yet implemented",  "---",
  "Train/Predict/Resample", "estimateResidualVariance()",                                                               "Not yet implemented",  "---",
  "Train/Predict/Resample", "getDefaultMeasure()",                                                                      "",                     "---",
  "Train/Predict/Resample", "getMeasureProperties()",                                                                   "???",                  "---",
  "Train/Predict/Resample", "getPredictionResponse() / getPredictionSE() / getPredictionTruth()",                       "???",                  "---",
  "Train/Predict/Resample", "getPredictionDump()",                                                                      "???",                  "---",
  "Train/Predict/Resample", "getPredictionTaskDesc()",                                                                  "???",                  "---",
  "Train/Predict/Resample", "getRRDump()",                                                                              "???",                  "---",
  "Train/Predict/Resample", "getRRPredictionList()",                                                                    "???",                  "---",
  "Train/Predict/Resample", "getRRPredictions()",                                                                       "ResampleResult$prediction","---",
  "Train/Predict/Resample", "getRRTaskDesc() / getRRTaskDescription()",                                                 "ResampleResult$task$print()","---",

  "Benchmark", "benchmark()",                                                       "benchmark()",                             "---",
  "Benchmark", "batchmark() / reduceBatchmarkResults()",                            "not used anymore ",                       "---",
  "Benchmark", "BenchmarkResult",                                                   "BenchmarkResult",                         "---",
  "Benchmark", "convertBMRToRankMatrix()",                                          "Not yet implemented",                     "---",
  "Benchmark", "convertMLBenchObjToTask()",                                         "Not yet implemented",                     "---",
  "Benchmark", "getBMRAggrPerformances()",                                          "BenchmarkResult$aggregated()",            "---",
  "Benchmark", "getBMRFeatSelResults()",                                            "Not yet implemented",                     "mlr3filters",
  "Benchmark", "getBMRFilteredFeatures()",                                          "Not yet implemented",                     "mlr3filters",
  "Benchmark", "getBMRLearners() / getBMRLearnerIds() / getBMRLearnerShortNames()", "BenchmarkResult$learners",                "---",
  "Benchmark", "getBMRMeasures() / getBMRMeasureIds()",                             "BenchmarkResult$measures",                "---",
  "Benchmark", "getBMRModels()",                                                    "BenchmarkResult$data$learner[[1]]$model", "---",
  "Benchmark", "getBMRPerformances()",                                              "BenchmarkResult$data$performance",        "---",
  "Benchmark", "getBMRTaskDescriptions() / getBMRTaskDescs() / getBMRTaskIds()",    "BenchmarkResult$tasks",                   "---",
  "Benchmark", "getBMRTuneResults()",                                               "Not yet implemented",                     "---",
  "Benchmark", "getBMRPredictions()",                                               "Not yet implemented",                     "---",
  "Benchmark", "friedmanTestBMR()",                                                 "Not yet implemented",                     "---",
  "Benchmark", "mergeBenchmarkResults()",                                           "BenchmarkResult$combine()",               "---",
  "Benchmark", "plotBMRBoxplots()",                                                 "Not yet implemented",                     "mlr3viz",
  "Benchmark", "plotBMRRanksAsBarChart()",                                          "Not yet implemented",                     "mlr3viz",
  "Benchmark", "plotBMRSummary()",                                                  "Not yet implemented",                     "mlr3viz",
  "Benchmark", "plotResiduals()",                                                   "Not yet implemented",                     "mlr3viz",

  "Parameter Specification", "ParamHelpers::makeNumericParam()",        "ParamDbl$new()",          "paradox",
  "Parameter Specification", "ParamHelpers::makeNumericVectorParam()",  "ParamDbl$new()",          "paradox",
  "Parameter Specification", "ParamHelpers::makeIntegerParam()",        "paradox::ParamInt$new()", "paradox",
  "Parameter Specification", "ParamHelpers::makeIntegerVectorParam()",  "paradox::ParamInt$new()", "paradox",
  "Parameter Specification", "ParamHelpers::makeDiscreteParam()",       "paradox::ParamFct$new()", "paradox",
  "Parameter Specification", "ParamHelpers::makeDiscreteVectorParam()", "paradox::ParamFct$new()", "paradox",
  "Parameter Specification", "ParamHelpers::makeLogicalParam()",        "paradox::ParamLgl$new()", "paradox",
  "Parameter Specification", "ParamHelpers::makeLogicalVectorParam()",  "paradox::ParamLgl$new()", "paradox",

  "Preprocessing", "---", "---", "---",
  "Preprocessing", "---", "---", "---",

  "Feature Selection", "makeFeatSelControlExhaustive()", "Not yet implemented", "mlr3filters",
  "Feature Selection", "makeFeatSelControlRandom()",     "Not yet implemented", "mlr3filters",
  "Feature Selection", "makeFeatSelControlSequential()", "Not yet implemented", "mlr3filters",
  "Feature Selection", "makeFeatSelControlGA()",         "Not yet implemented", "mlr3filters",
  "Feature Selection", "makeFilter()",                   "Filter$new()",        "mlr3filters",
  "Feature Selection", "FeatSelResult",                  "Not yet implemented", "mlr3filters",
  "Feature Selection", "listFilterMethods()",            "mlr_filters",         "mlr3filters",
  "Feature Selection", "analyzeFeatSelResult()",         "Not yet implemented", "mlr3filters",
  "Feature Selection", "getBMRFeatSelResults()",         "Not yet implemented", "mlr3filters",
  "Feature Selection", "getBMRFilteredFeatures()",       "Not yet implemented", "mlr3filters",
  "Feature Selection", "getFeatSelResult()",             "Not yet implemented", "mlr3filters",
  "Feature Selection", "getFeatureImportance()",         "Not yet implemented", "mlr3filters",
  "Feature Selection", "getFilteredFeatures()",          "Not yet implemented", "mlr3filters",
  "Feature Selection", "makeFeatSelWrapper()",           "Not used anymore",    "mlr3filters",
  "Feature Selection", "makeFilterWrapper()",            "Not used anymore",    "mlr3filters",
  "Feature Selection", "getResamplingIndices()",         "Not yet implemented", "",
  "Feature Selection", "selectFeatures()",               "Not yet implemented", "mlr3filters",
  "Feature Selection", "filterFeatures()",               "Filter$filter_*()",   "mlr3filters",
  "Feature Selection", "generateFilterValuesData()",     "Filter$calculate()",  "mlr3filters",
  "Feature Selection", "",                               "",                    "",

  "Tuning", "getTuneResult()",                 "Not yet implemented", "mlr3tuning",
  "Tuning", "getTuneResultOptPath()",          "Not yet implemented", "mlr3tuning",
  "Tuning", "makeTuneControl*()",              "Tuner",               "mlr3tuning",
  "Tuning", "makeTuneMultiCritControl*()",     "Tuner",               "mlr3tuning",

  "Parallelization", "ParallelMap::parallelStart*(), parallelMap::parallelStop()", "future::plan() / future", "",
  "Parallelization", "",                             "",                           "",

  "Plotting", "plotBMRBoxplots()",         "Not yet implemented", "mlr3viz",
  "Plotting", "plotBMRRanksAsBarChart()",  "Not yet implemented", "mlr3viz",
  "Plotting", "plotBMRSummary()",          "Not yet implemented", "mlr3viz",
  "Plotting", "plotCalibration()",         "Not yet implemented", "mlr3viz",
  "Plotting", "plotCritDifferences()",     "Not yet implemented", "mlr3viz",
  "Plotting", "plotFilterValues()",        "Not yet implemented", "mlr3viz",
  "Plotting", "plotHyperParsEffect()",     "Not yet implemented", "mlr3viz",
  "Plotting", "plotLearnerPrediction()",   "Not yet implemented", "mlr3viz",
  "Plotting", "plotLearningCurve()",       "Not yet implemented", "mlr3viz",
  "Plotting", "plotPartialDependence()",   "Not yet implemented", "mlr3viz",
  "Plotting", "plotResiduals()",           "Not yet implemented", "mlr3viz",
  "Plotting", "plotROCCurves()",           "Not yet implemented", "mlr3viz",
  "Plotting", "plotThreshVsPerf()",        "Not yet implemented", "mlr3viz",
  "Plotting", "plotTuneMultiCritResult()", "Not yet implemented", "mlr3viz",

  "FDA", "extractFDAFPCA()",                 "Not yet implemented",                 "mlr3fda",
  "FDA", "extractFDAFourier()",              "Not yet implemented",                 "mlr3fda",
  "FDA", "extractFDAMultiResFeatures()",     "Not yet implemented",                 "mlr3fda",
  "FDA", "extractFDAWavelets()",             "Not yet implemented",                 "mlr3fda"
))
t = knitr::kable(t)
kableExtra::collapse_rows(t, columns = 1) %>%
  kableExtra::kable_styling(bootstrap_options = "basic", full_width = T,
    font_size = 13)

nguyenngocbinh/mlr3_book_vi documentation built on Jan. 23, 2020, 12:28 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

nguyenngocbinh/mlr3_book_vi R package to build the mlr3 book via bookdown

In nguyenngocbinh/mlr3_book_vi: R package to build the mlr3 book via bookdown

Technical {#technical}

Parallelization {#parallelization}

Encapsulation

Fallback learners

Database Backends {#backends}

Use Case: NYC Flights

Preprocessing with dplyr

DataBackendDplyr

Model fitting

Cleanup

Parameters (using paradox) {#paradox}

Reference Based Objects

Defining a Parameter Space

Single Parameters

Type / Range Checking

Parameter Sets

Type / Range Checking

Values in a ParamSet

Dependencies

Vector Parameters

Parameter Sampling

Parameter Designs

Grid Design

Random Sampling

Generalized Sampling: The Sampler Class

1D-Samplers

Hierarchical Sampler

Joint Sampler

SamplerUnif

Parameter Transformation

Transformation between Types

Logging and Verbosity {#logging}

Available logging levels

Global Setting

Changing mlr3 logging levels

mlr -> mlr3 Transition Guide {#transition}

R Package Documentation

Browse R Packages

We want your feedback!

nguyenngocbinh/mlr3_book_vi
R package to build the mlr3 book via bookdown

Preprocessing with `dplyr`

Parameters (using `paradox`) {#paradox}

Values in a `ParamSet`

Generalized Sampling: The `Sampler` Class