kfold: (Un)Stratified k-fold for any type of label

Description Usage Arguments Details Value Examples

View source: R/kfold.R

Description

This function allows to create (un)stratified folds from a label vector.

Usage

1
kfold(y, k = 5, type = "random", seed = 0, named = TRUE)

Arguments

y

Type: numeric. The label vector (not a factor).

k

Type: integer. The amount of folds to create. Causes issues if length(y) < k (e.g more folds than samples). Defaults to 5.

type

Type: character. Whether the folds should be stratified (keep the same label proportions for classification), treatment (make each fold exclusive according to the label vector which becomes a vector), pseudo (pseudo-random, attempts to minimize the variance between folds for regression), or random (for fully random folds). Defaults to random.

seed

Type: integer. The seed for the random number generator. Defaults to 0.

named

Type: boolean. Whether the folds should be named. Defaults to TRUE.

Details

In contrary to Laurae::kfold, please do not use stratified for regression, use pseudo instead. I had complaints about weird fold generation when using stratification with regression labels: it just does not work the way it was intended (now, use stratified for classification stratification, and pseudo for regression stratification).

Value

A list of vectors for each fold, where an integer represents the row number.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Reproducible Stratified folds
data <- 1:5000
folds1 <- kfold(y = data, k = 5, type = "pseudo", seed = 111)
folds2 <- kfold(y = data, k = 5, type = "pseudo", seed = 111)
identical(folds1, folds2)

# Treatments
data <- c(rep(1:50, rep(50, 50)))
str(kfold(y = data, k = 5, type = "treatment"))

# Stratified Classification
data <- c(rep(0, 250), rep(1, 250))
folds <- kfold(y = data, k = 5, type = "stratified")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}

# Stratified Regression
data <- 1:5000
folds <- kfold(y = data, k = 5, type = "pseudo")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}

# Stratified Multi-class Classification
data <- c(rep(0, 250), rep(1, 250), rep(2, 250))
folds <- kfold(y = data, k = 5, type = "stratified")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}

# Unstratified Regression
data <- 1:5000
folds <- kfold(y = data, k = 5, type = "random")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}

# Unstratified Multi-class Classification
data <- c(rep(0, 250), rep(1, 250), rep(2, 250))
folds <- kfold(y = data, k = 5, type = "random")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}

Laurae2/LauraeDS documentation built on May 29, 2019, 2:25 p.m.