knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE )
{splitTools} is a fast, lightweight toolkit for data splitting.
Its two main functions partition()
and create_folds()
support
The function create_timefolds()
does time-series splitting in the sense that the out-of-sample data follows the in-sample data.
We will now illustrate how to use {splitTools} in a typical modeling workflow.
We will go through the following steps:
iris
data into 60% training, 20% validation, and 20% test data, stratified by the variable Sepal.Length
. Since this variable is numeric, stratification uses quantile binning.Sepal.Length
with a linear regression, once with and once without interaction between Species
and Sepal.Width
.library(splitTools) # Split data into partitions set.seed(3451) inds <- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2)) str(inds) train <- iris[inds$train, ] valid <- iris[inds$valid, ] test <- iris[inds$test, ] rmse <- function(y, pred) { sqrt(mean((y - pred)^2)) } # Use simple validation to decide on interaction yes/no... fit1 <- lm(Sepal.Length ~ ., data = train) fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = train) rmse(valid$Sepal.Length, predict(fit1, valid)) rmse(valid$Sepal.Length, predict(fit2, valid)) # Yes! Choose and test final model rmse(test$Sepal.Length, predict(fit2, test))
Since the iris
data consists of only 150 rows, investing 20% of observations for validation seems like a waste of resources. Furthermore, the performance estimates might not be very robust. Let's replace simple validation by five-fold CV, again using stratification on the response variable.
iris
into 80% training data and 20% test, stratified by the variable Sepal.Length
. # Split into training and test inds <- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2), seed = 87) train <- iris[inds$train, ] test <- iris[inds$test, ] # Get stratified CV in-sample indices folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734) # Vectors with results per model and fold cv_rmse1 <- cv_rmse2 <- numeric(5) for (i in seq_along(folds)) { insample <- train[folds[[i]], ] out <- train[-folds[[i]], ] fit1 <- lm(Sepal.Length ~ ., data = insample) fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample) cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out)) cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out)) } # CV-RMSE of model 1 -> close winner mean(cv_rmse1) # CV-RMSE of model 2 mean(cv_rmse2) # Fit model 1 on full training data and evaluate on test data final_fit <- lm(Sepal.Length ~ ., data = train) rmse(test$Sepal.Length, predict(final_fit, test))
If feasible, repeated CV is recommended in order to reduce uncertainty in decisions. Otherwise, the process remains the same.
# Train/test split as before # 15 folds instead of 5 folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734, m_rep = 3) cv_rmse1 <- cv_rmse2 <- numeric(15) # Rest as before... for (i in seq_along(folds)) { insample <- train[folds[[i]], ] out <- train[-folds[[i]], ] fit1 <- lm(Sepal.Length ~ ., data = insample) fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample) cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out)) cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out)) } mean(cv_rmse1) mean(cv_rmse2) # Refit and test as before
The function multi_strata()
creates a stratification factor from multiple columns that can then be passed to create_folds(, type = "stratified")
or partition(, type = "stratified")
. The resulting partitions will be (quite) balanced regarding these columns.
Two grouping strategies are offered:
Let's have a look at a simple example where we want to model "Sepal.Width" as a function of the other variables in the iris data set. We want to do a stratified train/valid/test split, aiming at being balanced regarding not only the response "Sepal.Width", but also regarding the important predictor "Species". In this case, we could use the following workflow:
set.seed(3451) ir <- iris[c("Sepal.Length", "Species")] y <- multi_strata(ir, k = 5) inds <- partition( y, p = c(train = 0.6, valid = 0.2, test = 0.2), split_into_list = FALSE ) # Check by(ir, inds, summary)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.