Resampling: Resampling Class

Description Format Construction Fields Methods Stratification Grouping / Blocking See Also Examples

Description

This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.

The objects of this class define how a task is partitioned for resampling (e.g., in resample() or benchmark()), using a set of hyperparameters such as the number of folds in cross-validation.

Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a fixed partition of row_ids of the Task.

Predefined resamplings are stored in the mlr3misc::Dictionary mlr_resamplings, e.g. cv or bootstrap.

Format

R6::R6Class object.

Construction

Note: This object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.

1
r = Resampling$new(id, param_set)

Fields

Methods

Stratification

All derived classes support stratified sampling.

First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete). The stratification variables must be included in the task and the stratify parameter can be set to the respective column names. Setting stratify to TRUE is an alias for stratify = task$target_names. In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.

Second, the sampling is performed in each of the k subpopulations separately. Each subgroup is divided into iter training sets and iter test sets by the derived Resampling. These sets are merged based on their iteration number: all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on. Same is done for all test sets. The merged sets can be accessed via $train_set(i) and $test_set(i), respectively.

Grouping / Blocking

All derived classes support grouping of observations.

Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set. The grouping variable is assumed to be discrete and must be stored in the Task with column role "groups".

The sampling is performed by the derived Resampling on the grouping variable. Next, the grouping information is replaced with the respective row ids to generate training and test sets. The sets can be accessed via $train_set(i) and $test_set(i), respectively.

See Also

Other Resampling: mlr_resamplings

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
r = rsmp("subsampling")

# Default parametrization
r$param_set$values

# Do only 3 repeats on 10% of the data
r$param_set$values = list(ratio = 0.1, repeats = 3)
r$param_set$values

# Instantiate on iris task
task = tsk("iris")
r$instantiate(task)

# Extract train/test sets
train_set = r$train_set(1)
print(train_set)
intersect(train_set, r$test_set(1))

# Another example: 10-fold CV
r = rsmp("cv")$instantiate(task)
r$train_set(1)

# Stratification
task = tsk("pima")
prop.table(table(task$truth())) # moderately unbalanced

r = rsmp("subsampling", stratify = TRUE)
r$instantiate(task)
prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion

mllg/mlr3 documentation built on Sept. 27, 2019, 9:38 a.m.