cvo_create_folds: Create a cvo (cross-valitation object)

Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/cvo_create_folds.R

Description

Create indices of folds with blocking and stratification (cvo object) Create a cross-validation object (cvo), which contain a list of indices for each fold of (repeated) k-fold cross-validation. Options of blocking and stratification are available. See more in "Details".

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
cvo_create_folds(
  data = NULL,
  stratify_by = NULL,
  block_by = NULL,
  folds = 5,
  times = 1,
  seeds = NA_real_,
  kind = NULL,
  mode = c("caret", "mlr")[1],
  returnTrain = c(TRUE, FALSE, "both")[1],
  predict = c("test", "train", "both")[1],
  k = folds
)

## S3 method for class 'cvo'
print(x, ...)

Arguments

data

A data frame, that contains variables which names are denoted by arguments block_by and by stratify_by.

stratify_by

A vector or a name of factor variable in data, which levels will be used for stratification. E.g., a vector with medical groups.

block_by

A vector or a name of variable in data, that contains identification codes/numbers (ID). These codes will be used for blocking.

folds, k

(integer)
A number of folds, default folds = 5.

times

(integer)
A number of repetitions for repeated cross-validation.

seeds

(NA_real_ | NULL | vector of integers)
Seeds for random number generator for each repetition.

  • If seeds = NA_real_ (default), no seeds are set, parameter kind is also ignored.

  • If seeds = NULL random seeds are generated automatically and registered in attribute "seeds".

  • If numeric vector, then these seeds will be used for each repetition of cross-validation. If the number of repetitions is greater than the number of provided seeds, additional seeds are generated and added to the vector. The first seed will be used to ensure reproducibility of the randomly generated seeds.

For more information about random number generation see set.seed.

kind

(NULL | character)
The kind of (pseudo)random number generator. Default is NULL, which selects the currently-used generator (including that used in the previous session if the workspace has been restored): if no generator has been used it selects "default".

Generator "L'Ecuyer-CMRG" is recommended if package parallel is used for for parallel computing. In this case each seed should have 6 elements neither the first three nor the last three should be all zero. More information at set.seed.

mode

(character)
Either caret-like or mlr-like cvo object. This option is not implemented yet!

returnTrain

(logical | character)
If TRUE, returns indices of variables in a training set (caret style). If FALSE, returns indices of variables in a test set (caret style). If "both", returns indices of variables in both training and test sets (mlr style).

predict

(character(1))
What to predict during resampling: “train”, “test” or “both” sets. Default is “test”.

x

A cvo object.

...

(any)
Further parameters for strategies.

iters (integer(1))

Number of iterations, for “CV”, “Subsample” and “Bootstrap”.

split (numeric(1))

Proportion of training cases for “Holdout” and “Subsample” between 0 and 1. Default is 2 / 3.

reps (integer(1))

Repeats for “RepCV”. Here iters = folds * reps. Default is 10.

folds (integer(1))

Folds in the repeated CV for RepCV. Here iters = folds * reps. Default is 10.

horizon (numeric(1))

Number of observations in the forecast test set for “GrowingWindowCV” and “FixedWindowCV”. When horizon > 1 this will be treated as the number of observations to forecast, else it will be a fraction of the initial window. IE, for 100 observations, initial window of .5, and horizon of .2, the test set will have 10 observations. Default is 1.

initial.window (numeric(1))

Fraction of observations to start with in the training set for “GrowingWindowCV” and “FixedWindowCV”. When initial.window > 1 this will be treated as the number of observations in the initial window, else it will be treated as the fraction of observations to have in the initial window. Default is 0.5.

skip (numeric(1))

How many resamples to skip to thin the total amount for “GrowingWindowCV” and “FixedWindowCV”. This is passed through as the “by” argument in seq(). When skip > 1 this will be treated as the increment of the sequence of resampling indices, else it will be a fraction of the total training indices. IE for 100 training sets and a value of .2, the increment of the resampling indices will be 20. Default is “horizon” which gives mutually exclusive chunks of test indices.

Details

Function cvo_create_folds randomly divides observations into folds that are used for (repeated) k-fold cross-validation. In these folds observations are:

  1. blocked by values in variable block_by (i.e. observations with the same "ID" or other kind of blocking factor are treated as one unit (a block) and are always in the same fold);

  2. stratified by levels of factor variable stratify_by (the proportions of these grouped units of observations per each group (level) are kept approximately constant throughout all folds).

Value

(list) A list of folds. In each fold there are indices observations. The structure of outputs is the similar to one created with either function createFolds from caret or function makeResampleInstance in mlr.

Note

If folds is too big and cases of at least one group (i.e., level in stratify_by) are not included in at least one fold, an error is returned. In that case smaller value of folds is recommended.

Author(s)

Vilmantas Gegzna

See Also

Function createFolds from package caret.
Function makeResampleInstance from package mlr.
Test if folds are blocked and stratified cvo_test_bs

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
library(manyROC)
set.seed(123456)

# Data
DataSet1 <- data.frame(ID = rep(1:20, each = 2),
  gr = gl(4, 10, labels = LETTERS[1:4]),
  .row = 1:40)

# Explore data
str(DataSet1)

table(DataSet1[, c("gr", "ID")])

summary(DataSet1)


# Explore functions
nFolds <- 5

# If variables of data frame are provided:
Folds1_a <- cvo_create_folds(data = DataSet1,
  stratify_by = "gr", block_by = "ID",
  k = nFolds, returnTrain = FALSE)
Folds1_a

str(Folds1_a)

cvo_test_bs(Folds1_a, "gr", "ID", DataSet1)

# If "free" variables are provided:
Folds1_b <- cvo_create_folds(stratify_by = DataSet1$gr,
  block_by = DataSet1$ID,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_b)
cvo_test_bs(Folds1_b, "gr", "ID", DataSet1)

# Not blocked but stratified
Folds1_c <- cvo_create_folds(stratify_by = DataSet1$gr,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_c)
cvo_test_bs(Folds1_c, "gr", "ID", DataSet1)

# Blocked but not stratified
Folds1_d <- cvo_create_folds(block_by = DataSet1$ID,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_d)
cvo_test_bs(Folds1_d, "gr", "ID", DataSet1)

GegznaV/multiROC documentation built on Sept. 15, 2020, 10:33 a.m.