Cross-validation is an important technique used in model selection and hyper-parameter optimization. Scores from cross-validation are a good estimation of test score of a predictive model in test data or new data as long as the IID assumption approximately holds in data. This package aims to provide a standardized pipeline for performing cross-validation for different modeling functions in R. In addition, summary statistics of the cross-validation results are provided for users.
The CrossR
package (short for Cross-validation in R) is a set of functions for implementing cross-validation inside the R environment.
Cross-validation can be implemented with the caret
package in R. caret
contains the function createDataPartition()
to split the data and train_Control()
to apply cross-validation with different methods depending on the method
argument. We have observed that caret
functions have some features that make the cross-validation process cumbersome. createDataPartition()
splits the indices of the data which could be used later on to actually split the data into training and test data. This will be applied with one step using split_data()
in CrossR
.
Three main functions in CrossR
:
train_test_split
: This function partitions data into k
-fold and returns the partitioned indices. A random shuffling option is provided. (stratification
option for imbalanced representations will also be included if time allows).
cross_validation
: This function performs k
-fold cross validation using the partitioned data and a selected model. It returns the scores of each validation. Additional methods for corss validation will be implemented (such as "Leave-One-Out" if time allows).
summary_cv
: This function outputs summary statistics(mean, median, standard deviation) of cross-validation scores.
For installation of this package:
devtools::install_github("UBC-MDS/CrossR")
Usage examples:
library(CrossR) data <- gen_data(100) X <- data.frame(data[[1]]) y_vec <- data[[2]] y <- data.frame(y = y_vec) split_data <- train_test_split(X, y, test_size = 0.25, random_state = 0, shuffle = TRUE) scores <- cross_validation(split_data[['X_train']], split_data[['y_train']], k = 3, shuffle = TRUE, random_state = 0) summary_cv(scores)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.