Resampling | R Documentation |
This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.
The objects of this class define how a task is partitioned for resampling (e.g., in resample()
or benchmark()
),
using a set of hyperparameters such as the number of folds in cross-validation.
Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a
fixed partition of row_ids
of the Task.
Predefined resamplings are stored in the dictionary mlr_resamplings,
e.g. cv
or bootstrap
.
All derived classes support stratified sampling.
The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum"
.
In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.
First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata
.
Second, the sampling is performed in each of the k
subpopulations separately.
Each subgroup is divided into iter
training sets and iter
test sets by the derived Resampling
.
These sets are merged based on their iteration number:
all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on.
Same is done for all test sets.
The merged sets can be accessed via $train_set(i)
and $test_set(i)
, respectively.
Note that this procedure can lead to set sizes that are slightly different from those
without stratification.
All derived classes support grouping of observations.
The grouping variable is assumed to be discrete and must be stored in the Task with column role "group"
.
Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.
The sampling is performed by the derived Resampling on the grouping variable.
Next, the grouping information is replaced with the respective row ids to generate training and test sets.
The sets can be accessed via $train_set(i)
and $test_set(i)
, respectively.
label
(character(1)
)
Label for this object.
Can be used in tables, plot and text output instead of the ID.
param_set
(paradox::ParamSet)
Set of hyperparameters.
instance
(any)
During instantiate()
, the instance is stored in this slot in an arbitrary format.
Note that if a grouping variable is present in the Task, a Resampling may operate on the
group ids internally instead of the row ids (which may lead to confusion).
It is advised to not work directly with the instance
, but instead only use the getters
$train_set()
and $test_set()
.
task_hash
(character(1)
)
The hash of the Task which was passed to r$instantiate()
.
task_nrow
(integer(1)
)
The number of observations of the Task which was passed to r$instantiate()
.
duplicated_ids
(logical(1)
)
If TRUE
, duplicated rows can occur within a single training set or within a single test set.
E.g., this is TRUE
for Bootstrap, and FALSE
for cross-validation.
Only used internally.
man
(character(1)
)
String in the format [pkg]::[topic]
pointing to a manual page for this object.
Defaults to NA
, but can be set by child classes.
id
(character(1)
)
Identifier of the object.
Used in tables, plot and text output.
is_instantiated
(logical(1)
)
Is TRUE
if the resampling has been instantiated.
hash
(character(1)
)
Hash (unique identifier) for this object.
new()
Creates a new instance of this R6 class.
Resampling$new( id, param_set = ps(), duplicated_ids = FALSE, label = NA_character_, man = NA_character_ )
id
(character(1)
)
Identifier for the new instance.
param_set
(paradox::ParamSet)
Set of hyperparameters.
duplicated_ids
(logical(1)
)
Set to TRUE
if this resampling strategy may have duplicated row ids in a single training set or test set.
Note that this object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.
label
(character(1)
)
Label for the new instance.
man
(character(1)
)
String in the format [pkg]::[topic]
pointing to a manual page for this object.
The referenced help package can be opened via method $help()
.
format()
Helper for print outputs.
Resampling$format(...)
...
(ignored).
print()
Printer.
Resampling$print(...)
...
(ignored).
help()
Opens the corresponding help page referenced by field $man
.
Resampling$help()
instantiate()
Materializes fixed training and test splits for a given task and stores them in r$instance
in an arbitrary format.
Resampling$instantiate(task)
task
(Task)
Task used for instantiation.
Returns the object itself, but modified by reference.
You need to explicitly $clone()
the object beforehand if you want to keeps
the object in its previous state.
train_set()
Returns the row ids of the i-th training set.
Resampling$train_set(i)
i
(integer(1)
)
Iteration.
(integer()
) of row ids.
test_set()
Returns the row ids of the i-th test set.
Resampling$test_set(i)
i
(integer(1)
)
Iteration.
(integer()
) of row ids.
clone()
The objects of this class are cloneable with this method.
Resampling$clone(deep = FALSE)
deep
Whether to make a deep clone.
Chapter in the mlr3book: https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html#sec-resampling
Package mlr3spatiotempcv for spatio-temporal resamplings.
Dictionary of Resamplings: mlr_resamplings
as.data.table(mlr_resamplings)
for a table of available Resamplings in the running session (depending on the loaded packages).
mlr3spatiotempcv for additional Resamplings for spatio-temporal tasks.
Other Resampling:
mlr_resamplings
,
mlr_resamplings_bootstrap
,
mlr_resamplings_custom
,
mlr_resamplings_custom_cv
,
mlr_resamplings_cv
,
mlr_resamplings_holdout
,
mlr_resamplings_insample
,
mlr_resamplings_loo
,
mlr_resamplings_repeated_cv
,
mlr_resamplings_subsampling
r = rsmp("subsampling")
# Default parametrization
r$param_set$values
# Do only 3 repeats on 10% of the data
r$param_set$values = list(ratio = 0.1, repeats = 3)
r$param_set$values
# Instantiate on penguins task
task = tsk("penguins")
r$instantiate(task)
# Extract train/test sets
train_set = r$train_set(1)
print(train_set)
intersect(train_set, r$test_set(1))
# Another example: 10-fold CV
r = rsmp("cv")$instantiate(task)
r$train_set(1)
# Stratification
task = tsk("pima")
prop.table(table(task$truth())) # moderately unbalanced
task$col_roles$stratum = task$target_names
r = rsmp("subsampling")
r$instantiate(task)
prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.