# createDataPartition: Data Splitting functions In caret: Classification and Regression Training

## Description

A series of test/training partitions are created using `createDataPartition` while `createResample` creates one or more bootstrap samples. `createFolds` splits the data into `k` groups while `createTimeSlices` creates cross-validation split for series data. `groupKFold` splits the data based on a grouping factor.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17``` ```createDataPartition( y, times = 1, p = 0.5, list = TRUE, groups = min(5, length(y)) ) createFolds(y, k = 10, list = TRUE, returnTrain = FALSE) createMultiFolds(y, k = 10, times = 5) createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE, skip = 0) groupKFold(group, k = length(unique(group))) createResample(y, times = 10, list = TRUE) ```

## Arguments

 `y` a vector of outcomes. For `createTimeSlices`, these should be in chronological order. `times` the number of partitions to create `p` the percentage of data that goes to training `list` logical - should the results be in a list (`TRUE`) or a matrix with the number of rows equal to `floor(p * length(y))` and `times` columns. `groups` for numeric `y`, the number of breaks in the quantiles (see below) `k` an integer for the number of folds. `returnTrain` a logical. When true, the values returned are the sample positions corresponding to the data used during training. This argument only works in conjunction with `list = TRUE` `initialWindow` The initial number of consecutive values in each training set sample `horizon` the number of consecutive values in test set sample `fixedWindow` logical, if `FALSE`, all training samples start at 1 `skip` integer, how many (if any) resamples to skip to thin the total amount `group` a vector of groups whose length matches the number of rows in the overall data set.

## Details

For bootstrap samples, simple random sampling is used.

For other data splitting, the random sampling is done within the levels of `y` when `y` is a factor in an attempt to balance the class distributions within the splits.

For numeric `y`, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For `createDataPartition`, the number of percentiles is set via the `groups` argument. For `createFolds` and `createMultiFolds`, the number of groups is set dynamically based on the sample size and `k`. For smaller samples sizes, these two functions may not do stratified splitting and, at most, will split the data into quartiles.

Also, for `createDataPartition`, very small class sizes (<= 3) the classes may not show up in both the training and test data

For multiple k-fold cross-validation, completely independent folds are created. The names of the list objects will denote the fold membership using the pattern "Foldi.Repj" meaning the ith section (of k) of the jth cross-validation set (of `times`). Note that this function calls `createFolds` with `list = TRUE` and `returnTrain = TRUE`.

Hyndman and Athanasopoulos (2013)) discuss rolling forecasting origin techniques that move the training and test sets in time. `createTimeSlices` can create the indices for this type of splitting.

For Group k-fold cross-validation, the data are split such that no group is contained in both the modeling and holdout sets. One or more group could be left out, depending on the value of `k`.

## Value

A list or matrix of row position integers corresponding to the training data. For `createTimeSlices` subsamples are named by the end index of each training subsample.

## Author(s)

Max Kuhn, `createTimeSlices` by Tony Cooper

## References

Hyndman and Athanasopoulos (2013), Forecasting: principles and practice. https://otexts.com/fpp2/

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33``` ```data(oil) createDataPartition(oilType, 2) x <- rgamma(50, 3, .5) inA <- createDataPartition(x, list = FALSE) plot(density(x[inA])) rug(x[inA]) points(density(x[-inA]), type = "l", col = 4) rug(x[-inA], col = 4) createResample(oilType, 2) createFolds(oilType, 10) createFolds(oilType, 5, FALSE) createFolds(rnorm(21)) createTimeSlices(1:9, 5, 1, fixedWindow = FALSE) createTimeSlices(1:9, 5, 1, fixedWindow = TRUE) createTimeSlices(1:9, 5, 3, fixedWindow = TRUE) createTimeSlices(1:9, 5, 3, fixedWindow = FALSE) createTimeSlices(1:15, 5, 3) createTimeSlices(1:15, 5, 3, skip = 2) createTimeSlices(1:15, 5, 3, skip = 3) set.seed(131) groups <- sort(sample(letters[1:4], size = 20, replace = TRUE)) table(groups) folds <- groupKFold(groups) lapply(folds, function(x, y) table(y[x]), y = groups) ```

caret documentation built on Oct. 9, 2021, 5:09 p.m.