cvo_create_folds: Create a cvo (cross-valitation object)
In GegznaV/multiROC: Tools for ROC Analyis

Description Usage Arguments Details Value Note Author(s) See Also Examples

Create indices of folds with blocking and stratification (cvo object) Create a cross-validation object (cvo), which contain a list of indices for each fold of (repeated) k-fold cross-validation. Options of blocking and stratification are available. See more in "Details".

cvo_create_folds(
  data = NULL,
  stratify_by = NULL,
  block_by = NULL,
  folds = 5,
  times = 1,
  seeds = NA_real_,
  kind = NULL,
  mode = c("caret", "mlr")[1],
  returnTrain = c(TRUE, FALSE, "both")[1],
  predict = c("test", "train", "both")[1],
  k = folds
)

## S3 method for class 'cvo'
print(x, ...)

`data`	A data frame, that contains variables which names are denoted by arguments `block_by` and by `stratify_by`.
`stratify_by`	A vector or a name of factor variable in `data`, which levels will be used for stratification. E.g., a vector with medical groups.
`block_by`	A vector or a name of variable in `data`, that contains identification codes/numbers (ID). These codes will be used for blocking.
`folds, k`	(`integer`) A number of folds, default `folds = 5`.
`times`	(`integer`) A number of repetitions for repeated cross-validation.
`seeds`	(`NA_real_` \| `NULL` \| vector of integers) Seeds for random number generator for each repetition. If `seeds = NA_real_` (default), no seeds are set, parameter `kind` is also ignored. If `seeds = NULL` random seeds are generated automatically and registered in attribute `"seeds"`. If numeric vector, then these seeds will be used for each repetition of cross-validation. If the number of repetitions is greater than the number of provided seeds, additional seeds are generated and added to the vector. The first seed will be used to ensure reproducibility of the randomly generated seeds. For more information about random number generation see `set.seed`.
`kind`	(`NULL` \| `character`) The kind of (pseudo)random number generator. Default is `NULL`, which selects the currently-used generator (including that used in the previous session if the workspace has been restored): if no generator has been used it selects `"default"`. Generator `"L'Ecuyer-CMRG"` is recommended if package parallel is used for for parallel computing. In this case each seed should have 6 elements neither the first three nor the last three should be all zero. More information at `set.seed`.
`mode`	(`character`) Either caret-like or mlr-like cvo object. This option is not implemented yet!
`returnTrain`	(`logical` \| `character`) If `TRUE`, returns indices of variables in a training set (caret style). If `FALSE`, returns indices of variables in a test set (caret style). If `"both"`, returns indices of variables in both training and test sets (mlr style).
`predict`	(`character(1)`) What to predict during resampling: “train”, “test” or “both” sets. Default is “test”.
`x`	A `cvo` object.
`...`	(any) Further parameters for strategies. iters (`integer(1)`) Number of iterations, for “CV”, “Subsample” and “Bootstrap”. split (`numeric(1)`) Proportion of training cases for “Holdout” and “Subsample” between 0 and 1. Default is 2 / 3. reps (`integer(1)`) Repeats for “RepCV”. Here `iters = folds * reps`. Default is 10. folds (`integer(1)`) Folds in the repeated CV for `RepCV`. Here `iters = folds * reps`. Default is 10. horizon (`numeric(1)`) Number of observations in the forecast test set for “GrowingWindowCV” and “FixedWindowCV”. When `horizon > 1` this will be treated as the number of observations to forecast, else it will be a fraction of the initial window. IE, for 100 observations, initial window of .5, and horizon of .2, the test set will have 10 observations. Default is 1. initial.window (`numeric(1)`) Fraction of observations to start with in the training set for “GrowingWindowCV” and “FixedWindowCV”. When `initial.window > 1` this will be treated as the number of observations in the initial window, else it will be treated as the fraction of observations to have in the initial window. Default is 0.5. skip (`numeric(1)`) How many resamples to skip to thin the total amount for “GrowingWindowCV” and “FixedWindowCV”. This is passed through as the “by” argument in `seq()`. When `skip > 1` this will be treated as the increment of the sequence of resampling indices, else it will be a fraction of the total training indices. IE for 100 training sets and a value of .2, the increment of the resampling indices will be 20. Default is “horizon” which gives mutually exclusive chunks of test indices.

Function cvo_create_folds randomly divides observations into folds that are used for (repeated) k-fold cross-validation. In these folds observations are:

blocked by values in variable block_by (i.e. observations with the same "ID" or other kind of blocking factor are treated as one unit (a block) and are always in the same fold);
stratified by levels of factor variable stratify_by (the proportions of these grouped units of observations per each group (level) are kept approximately constant throughout all folds).

(list) A list of folds. In each fold there are indices observations. The structure of outputs is the similar to one created with either function createFolds from caret or function makeResampleInstance in mlr.

If folds is too big and cases of at least one group (i.e., level in stratify_by) are not included in at least one fold, an error is returned. In that case smaller value of folds is recommended.

Vilmantas Gegzna

Function createFolds from package caret.
Function makeResampleInstance from package mlr.
Test if folds are blocked and stratified cvo_test_bs

library(manyROC)
set.seed(123456)

# Data
DataSet1 <- data.frame(ID = rep(1:20, each = 2),
  gr = gl(4, 10, labels = LETTERS[1:4]),
  .row = 1:40)

# Explore data
str(DataSet1)

table(DataSet1[, c("gr", "ID")])

summary(DataSet1)


# Explore functions
nFolds <- 5

# If variables of data frame are provided:
Folds1_a <- cvo_create_folds(data = DataSet1,
  stratify_by = "gr", block_by = "ID",
  k = nFolds, returnTrain = FALSE)
Folds1_a

str(Folds1_a)

cvo_test_bs(Folds1_a, "gr", "ID", DataSet1)

# If "free" variables are provided:
Folds1_b <- cvo_create_folds(stratify_by = DataSet1$gr,
  block_by = DataSet1$ID,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_b)
cvo_test_bs(Folds1_b, "gr", "ID", DataSet1)

# Not blocked but stratified
Folds1_c <- cvo_create_folds(stratify_by = DataSet1$gr,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_c)
cvo_test_bs(Folds1_c, "gr", "ID", DataSet1)

# Blocked but not stratified
Folds1_d <- cvo_create_folds(block_by = DataSet1$ID,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_d)
cvo_test_bs(Folds1_d, "gr", "ID", DataSet1)