resample: Resampling schemes
In emil: Evaluation of Modeling without Information Leakage

Description Usage Arguments Details Value Author(s) See Also Examples

Performance evaluation and parameter tuning use resampling methods to estimate the performance of models. These are defined by resampling schemes, which are data frames where each column corresponds to a division of the data set into mutually exclusive training and test sets. Repeated hold out and cross-validation are two methods to create such schemes.

resample(method, y, ..., subset = TRUE)

resample_holdout(y, test_fraction = 0.5, nfold = 5,
  balanced = is.factor(y), subset)

resample_crossvalidation(y, nfold = 5, nrepeat = 5,
  balanced = is.factor(y), subset)

resample_bootstrap(y, nfold = 10, fit_fraction = if (replace) 1 else 0.632,
  replace = TRUE, balanced = is.factor(y), subset)

`method`	The resampling method to use, e.g. `"holdout"` or `"crossvalidation"`.
`y`	Observations to be divided.
`...`	Sent to the method specific function, e.g. `"resample_holdout"`.
`subset`	Which objects in `y` that are to be divided and which that are not to be part of neither set. If `subset` is a resampling scheme, a list of inner cross-validation schemes will be returned.
`test_fraction`	Fraction of objects to hold out (0 < test_fraction < 1).
`nfold`	Number of folds.
`balanced`	Whether the sets should be balanced or not, i.e. if the class ratio over the sets should be kept constant (as far as possible).
`nrepeat`	Number of fold sets to generate.
`fit_fraction`	The size of the training set relative to the entire data set.
`replace`	Whether to sample with replacement.

Note that when setting up analyzes, the user should not call resample_holdout or resample_crossvalidation directly, as resample performs additional necessary processing of the scheme.

Resampling scheme can be visualized in a human digestible form with the image function.

Functions for generating custom resampling schemes should be implemented as follows and then called by resample("myMethod", ...):

resample_myMethod <- function(y, ..., subset)

y: Response vector.
...: Method specific attributes.
subset: Indexes of observations to be excluded for the resampling.

The function should return a list of the following elements:

folds: A data frame with the folds of the scheme that conforms to the description in the 'Value' section below.
parameter: A list with the parameters necessary to generate such a resampling scheme. These are needed when creating subschemes needed for parameter tuning, see subresample.

A data frame defining a resampling scheme. TRUE or a positive integer codes for training set and FALSE or 0 codes for test set. Positive integers > 1 code for multiple copies of an observation in the training set. NA codes for neither training nor test set and is used to exclude observations from the analysis altogether.

Christofer Bäcklin

emil, subresample, image.resample, index_fit

resample("holdout", 1:50, test_fraction=1/3)
resample("holdout", factor(runif(60) >= .5))
y <- factor(runif(60) >= .5)
cv <- resample("crossvalidation", y)
image(cv, main="Cross-validation scheme")