if (requireNamespace("knitr", quietly = TRUE)){ knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", dpi = 92, fig.retina = 2 ) } # Get minimum R requirement dep <- as.vector(read.dcf('DESCRIPTION')[, 'Depends']) rvers <- substring(dep, 7, nchar(dep)-1) # m <- regexpr('R *\\\\(>= \\\\d+.\\\\d+.\\\\d+\\\\)', dep) # rm <- regmatches(dep, m) # rvers <- gsub('.*(\\\\d+.\\\\d+.\\\\d+).*', '\\\\1', dep) # Function for TOC # https://gist.github.com/gadenbuie/c83e078bf8c81b035e32c3fc0cf04ee8
Cross-Validation for Model Selection
Authors: Ludvig R. Olsen ( r-pkgs@ludvigolsen.dk ), Hugh Benjamin Zachariae
License: MIT
Started: October 2016
R package for model evaluation and comparison.
Currently supports regression ('gaussian'
), binary classification ('binomial'
), and (some functions only) multiclass classification ('multinomial'
). Many of the functions allow parallelization, e.g. through the doParallel
package.
NEW: Our new application for plotting confusion matrices with plot_confusion_matrix()
without any code is now available on Huggingface Spaces.
| Function | Description |
|:----------------------|:-------------------------------------------------------------------|
|cross_validate()
|Cross-validate linear models with lm()
/lmer()
/glm()
/glmer()
|
|cross_validate_fn()
|Cross-validate a custom model function |
|validate()
|Validate linear models with (lm
/lmer
/glm
/glmer
) |
|validate_fn()
|Validate a custom model function |
|evaluate()
|Evaluate predictions with a large set of metrics |
|baseline()
baseline_gaussian()
baseline_binomial()
baseline_multinomial()
|Perform baseline evaluations of a dataset |
| Function | Description |
|:----------------------|:-------------------------------------------------------------------|
|confusion_matrix()
|Create a confusion matrix from predictions and targets |
|evaluate_residuals()
|Evaluate residuals from a regression task |
|most_challenging()
|Find the observations that were the most challenging to predict |
|summarize_metrics()
|Summarize numeric columns with a set of descriptors |
| Function | Description |
|:------------------------|:-------------------------------------------------------------------|
|combine_predictors()
|Generate model formulas from a list of predictors |
|reconstruct_formulas()
|Extract formulas from output tibble |
|simplify_formula()
|Remove inline functions with more from a formula object |
| Function | Description |
|:-------------------------|:-------------------------------------------------------------------|
|plot_confusion_matrix()
|Plot a confusion matrix (see also our no-code application) |
|plot_metric_density()
|Create a density plot for a metric column |
|font()
|Set font settings for plotting functions (currently only plot_confusion_matrix()
) |
|sum_tile_settings()
|Set settings for sum tiles in plot_confusion_matrix()
|
| Function | Description |
|:--------------------------|:--------------------------------------------------------|
|model_functions()
|Example model functions for cross_validate_fn()
|
|predict_functions()
|Example predict functions for cross_validate_fn()
|
|preprocess_functions()
|Example preprocess functions for cross_validate_fn()
|
|update_hyperparameters()
|Manage hyperparameters in custom model functions |
| Function | Description |
|:----------------------|:-------------------------------------------------------------------|
|select_metrics()
|Select the metric columns from the output |
|select_definitions()
|Select the model-defining columns from the output |
|gaussian_metrics()
binomial_metrics()
multinomial_metrics()
|Create list of metrics for the common metrics
argument |
|multiclass_probability_tibble()
|Generate a multiclass probability tibble |
| Name | Description |
|:--------------------------|:-------------------------------------------------------------------|
|participant.scores
|Made-up experiment data with 10 participants and two diagnoses |
|wines
|A list of wine varieties in an approximately Zipfian distribution |
|musicians
|Made-up data on 60 musicians in 4 groups for multiclass classification |
| predicted.musicians
|Predictions by 3 classifiers of the 4 classes in the musicians
dataset |
|precomputed.formulas
|Fixed effect combinations for model formulas with/without two- and three-way interactions |
|compatible.formula.terms
|162,660 pairs of compatible terms for building model formulas with up to 15 fixed effects |
cvms:::render_toc("README.Rmd")
Check
NEWS.md
for the full list of changes.
1.2.0
contained multiple breaking changes. Please see NEWS.md
. (18th of October 2020)CRAN:
install.packages("cvms")
Development version:
install.packages("devtools")
devtools::install_github("LudvigOlsen/groupdata2")
devtools::install_github("LudvigOlsen/cvms")
cvms
contains a number of vignettes with relevant use cases and descriptions:
vignette(package = "cvms")
# for an overview
library(cvms) library(groupdata2) # fold() partition() library(knitr) # kable() library(dplyr) # %>% arrange()
The dataset participant.scores
comes with cvms
:
data <- participant.scores
Create a grouping factor for subsetting of folds using groupdata2::fold()
. Order the dataset by the folds:
# Set seed for reproducibility set.seed(7) # Fold data data <- fold( data = data, k = 4, cat_col = 'diagnosis', id_col = 'participant') %>% arrange(.folds) # Show first 15 rows of data data %>% head(15) %>% kable()
CV1 <- cross_validate( data = data, formulas = "score ~ diagnosis", fold_cols = '.folds', family = 'gaussian', REML = FALSE ) # Show results CV1 # Let's take a closer look at the different parts of the output # Metrics and formulas CV1 %>% select_metrics() %>% kable() # Just the formulas CV1 %>% select_definitions() %>% kable() # Nested predictions # Note that [[1]] picks predictions for the first row CV1$Predictions[[1]] %>% head() %>% kable() # Nested results from the different folds CV1$Results[[1]] %>% kable() # Nested model coefficients # Note that you have the full p-values, # but kable() only shows a certain number of digits CV1$Coefficients[[1]] %>% kable() # Additional information about the model # and the training process CV1 %>% select(14:19, 21) %>% kable() CV1$Process[[1]]
CV2 <- cross_validate( data = data, formulas = "diagnosis~score", fold_cols = '.folds', family = 'binomial' ) # Show results CV2 # Let's take a closer look at the different parts of the output # We won't repeat the parts too similar to those in Gaussian # Metrics CV2 %>% select(1:9) %>% kable(digits = 5) CV2 %>% select(10:15) %>% kable() # Confusion matrix CV2$`Confusion Matrix`[[1]] %>% kable() # Plot confusion matrix plot_confusion_matrix(CV2$`Confusion Matrix`[[1]], add_sums = TRUE)
model_formulas <- c("score ~ diagnosis", "score ~ age") mixed_model_formulas <- c("score ~ diagnosis + (1|session)", "score ~ age + (1|session)")
CV3 <- cross_validate( data = data, formulas = model_formulas, fold_cols = '.folds', family = 'gaussian', REML = FALSE ) # Show results CV3
CV4 <- cross_validate( data = data, formulas = mixed_model_formulas, fold_cols = '.folds', family = 'gaussian', REML = FALSE ) # Show results CV4
Instead of only dividing our data into folds once, we can do it multiple times and average the results. As the models can be ranked differently with different splits, this is generally preferable.
Let's first add some extra fold columns. We will use the num_fold_cols
argument to add 3 unique fold columns. We tell fold()
to keep the existing fold column and simply add three extra columns. We could also choose to remove the existing fold column, if, for instance, we were changing the number of folds (k
). Note, that the original fold column will be renamed to ".folds_1"
.
# Set seed for reproducibility set.seed(2) # Ungroup data # Ootherwise we would create folds within the existing folds data <- dplyr::ungroup(data) # Fold data data <- fold( data = data, k = 4, cat_col = 'diagnosis', id_col = 'participant', num_fold_cols = 3, handle_existing_fold_cols = "keep" ) # Show first 15 rows of data data %>% head(10) %>% kable()
Now, let's cross-validate the four fold columns. We use paste0()
to create the four column names:
CV5 <- cross_validate( data = data, formulas = c("diagnosis ~ score", "diagnosis ~ score + age"), fold_cols = paste0(".folds_", 1:4), family = 'binomial' ) # Show results CV5 # Subset of the results per fold for the first model CV5$Results[[1]] %>% select(1:8) %>% kable()
cross_validate_fn()
allows us to cross-validate a custom model function, like a support vector machine or a neural network. It works with regression (gaussian
), binary classification (binomial
), and multiclass classification (multinomial
).
It is required to pass a model function and a predict function. Further, it is possible to pass a preprocessing function and a list of hyperparameter values to test with grid search. You can check the requirements for these functions at ?cross_validate_fn
.
Let's cross-validate a support-vector machine using the svm()
function from the e1071
package.
First, we will create a model function. You can do anything you want inside it, as long as it takes the arguments train_data
, formula
, and hyperparameters
and returns a fitted model object:
# Create model function # # train_data : tibble with the training data # formula : a formula object # hyperparameters : a named list of hyparameters svm_model_fn <- function(train_data, formula, hyperparameters){ # Note that `formula` must be passed first # when calling svm(), otherwise it fails e1071::svm( formula = formula, data = train_data, kernel = "linear", type = "C-classification", probability = TRUE ) }
We also need a predict function. This will usually wrap the stats::predict()
function. The point is to ensure that the predictions have the correct format. In this case, we want a single column with the probability of the positive class.
Note, that you do not need to use the formula
, hyperparameters
, and train_data
arguments within your function. These are there for the few cases, where they are needed.
# Create predict function # # test_data : tibble with the test data # model : fitted model object # formula : a formula object # hyperparameters : a named list of hyparameters # train_data : tibble with the training data svm_predict_fn <- function(test_data, model, formula, hyperparameters, train_data){ # Predict the test set with the model predictions <- stats::predict( object = model, newdata = test_data, allow.new.levels = TRUE, probability = TRUE ) # Extract the probabilities # Usually the predict function will just # output the probabilities directly probabilities <- dplyr::as_tibble( attr(predictions, "probabilities") ) # Return second column # with probabilities of positive class probabilities[[2]] }
With these functions defined, we can cross-validate the support-vector machine:
# Cross-validate svm_model_fn CV6 <- cross_validate_fn( data = data, model_fn = svm_model_fn, predict_fn = svm_predict_fn, formulas = c("diagnosis ~ score", "diagnosis ~ age"), fold_cols = '.folds_1', type = 'binomial' ) CV6
Let's try with a naïve Bayes classifier as well. First, we will define the model function:
# Create model function # # train_data : tibble with the training data # formula : a formula object # hyperparameters : a named list of hyparameters nb_model_fn <- function(train_data, formula, hyperparameters){ e1071::naiveBayes( formula = formula, data = train_data ) }
And the predict function:
# Create predict function # # test_data : tibble with the test data # model : fitted model object # formula : a formula object # hyperparameters : a named list of hyparameters # train_data : tibble with the training data nb_predict_fn <- function(test_data, model, formula, hyperparameters, train_data){ stats::predict( object = model, newdata = test_data, type = "raw", allow.new.levels = TRUE)[, 2] }
With both functions specified, we are ready to cross-validate our naïve Bayes classifier:
CV7 <- cross_validate_fn( data = data, model_fn = nb_model_fn, predict_fn = nb_predict_fn, formulas = c("diagnosis ~ score", "diagnosis ~ age"), type = 'binomial', fold_cols = '.folds_1' ) CV7
If we wish to investigate why some observations are harder to predict than others, we should start by identifying the most challenging observations. This can be done with most_challenging()
.
Let's first extract the predictions from some of the cross-validation results:
glm_predictions <- dplyr::bind_rows(CV5$Predictions, .id = "Model") svm_predictions <- dplyr::bind_rows(CV6$Predictions, .id = "Model") nb_predictions <- dplyr::bind_rows(CV7$Predictions, .id = "Model") predictions <- dplyr::bind_rows( glm_predictions, svm_predictions, nb_predictions, .id = "Architecture" ) predictions[["Target"]] <- as.character(predictions[["Target"]]) predictions
Now, let's find the overall most difficult to predict observations. most_challenging()
calculates the Accuracy
, MAE
, and Cross-Entropy
for each prediction. We can then extract the observations with the ~20% highest MAE
scores. Note that most_challenging()
works with grouped data frames as well.
challenging <- most_challenging( data = predictions, prediction_cols = "Prediction", type = "binomial", threshold = 0.20, threshold_is = "percentage" ) challenging
We can then extract the difficult observations from the dataset. First, we add an index to the dataset. Then, we perform a right-join, to only get the rows that are in the challenging
data frame.
# Index with values 1:30 data[["Observation"]] <- seq_len(nrow(data)) # Add information to the challenging observations challenging <- data %>% # Remove fold columns for clarity dplyr::select(-c(.folds_1, .folds_2, .folds_3, .folds_4)) %>% # Add the scores dplyr::right_join(challenging, by = "Observation") challenging %>% kable()
Note: You may have to scroll to the right in the table.
We can also evaluate predictions from a model trained outside cvms
. This works with regression ('gaussian'
), binary classification ('binomial'
), and multiclass classification ('multinomial'
).
Extract the targets and predictions from the first cross-validation we performed and evaluate it with evaluate()
. We group the data frame by the Fold
column to evaluate each fold separately:
# Extract the predictions from the first cross-validation predictions <- CV1$Predictions[[1]] predictions %>% head(6) %>% kable() # Evaluate the predictions per fold predictions %>% group_by(Fold) %>% evaluate( target_col = "Target", prediction_cols = "Prediction", type = "gaussian" )
We can do the same for the predictions from the second, binomial cross-validation:
# Extract the predictions from the second cross-validation predictions <- CV2$Predictions[[1]] predictions %>% head(6) %>% kable() # Evaluate the predictions per fold predictions %>% group_by(Fold) %>% evaluate( target_col = "Target", prediction_cols = "Prediction", type = "binomial" )
We will use the multiclass_probability_tibble()
helper to generate a data frame with predicted probabilities for three classes, along with the predicted class and the target class. Then, we will 1) evaluate the three probability columns against the targets (preferable format), and 2) evaluate the predicted classes against the targets:
# Create dataset for multinomial evaluation multiclass_data <- multiclass_probability_tibble( num_classes = 3, # Here, number of predictors num_observations = 30, apply_softmax = TRUE, add_predicted_classes = TRUE, add_targets = TRUE) multiclass_data # Evaluate probabilities # One prediction column *per class* ev <- evaluate( data = multiclass_data, target_col = "Target", prediction_cols = paste0("class_", 1:3), type = "multinomial" ) ev # The one-vs-all evaluations ev$`Class Level Results`[[1]] # Evaluate the predicted classes # One prediction column with the class names evaluate( data = multiclass_data, target_col = "Target", prediction_cols = "Predicted Class", type = "multinomial" )
While it's common to find the chance-level baseline analytically (in classification tasks), it's often possible to get a better evaluation than that by chance. Hence, it is useful to check the range of our metrics when randomly guessing the probabilities.
Usually, we use baseline()
on our test set at the start of our modeling process, so we know what level of performance we should beat.
Note: Where baseline()
works with all three families (gaussian
, binomial
and multinomial
),
each family also has a wrapper function (e.g. baseline_gaussian()
) that is easier to use. We use those here.
Start by partitioning the dataset:
# Set seed for reproducibility set.seed(1) # Partition the dataset partitions <- groupdata2::partition( participant.scores, p = 0.7, cat_col = 'diagnosis', id_col = 'participant', list_out = TRUE ) train_set <- partitions[[1]] test_set <- partitions[[2]]
Approach: n
random sets of predictions are evaluated against the dependent variable in the test set. We also evaluate a set of all 0
s and a set of all 1
s.
Create the baseline evaluations:
# Perform binomial baseline evaluation # Note: It's worth enabling parallelization (see ?baseline examples) binomial_baseline <- baseline_binomial( test_data = test_set, dependent_col = "diagnosis", n = 100 ) binomial_baseline$summarized_metrics
On average, we can expect an F1
score of approximately 0.481
. The maximum F1
score achieved by randomly guessing was 0.833
though. That's likely because of the small size of the test set, but it illustrates how such information could be useful in a real-life scenario.
The All_1
row shows us that we can achieve an F1
score of 0.667
by always predicting 1
. Some model architectures, like neural networks, have a tendency to always predict the majority class. Such a model is quite useless of course, why it is good to be aware of the performance it could achieve. We could also check the confusion matrix for such a pattern.
binomial_baseline$random_evaluations
We can plot the distribution of F1
scores from the random evaluations:
# First, remove the NAs from the F1 column random_evaluations <- binomial_baseline$random_evaluations random_evaluations <- random_evaluations[!is.na(random_evaluations$F1),] # Create density plot for F1 plot_metric_density(baseline = random_evaluations, metric = "F1", xlim = c(0, 1))
Approach: Creates one-vs-all (binomial) baseline evaluations for n
sets of random predictions against the dependent variable, along with sets of all class x,y,z,...
predictions.
Create the baseline evaluations:
multiclass_baseline <- baseline_multinomial( test_data = multiclass_data, dependent_col = "Target", n = 100 ) # Summarized metrics multiclass_baseline$summarized_metrics
The CL_
measures describe the Class Level Results
(aka. one-vs-all evaluations). One of the classes have a maximum Balanced Accuracy
score of 0.770
, while the maximum Balanced Accuracy
in the random evaluations is 0.664
.
# Summarized class level results for class 1 multiclass_baseline$summarized_class_level_results %>% dplyr::filter(Class == "class_1") %>% tidyr::unnest(Results) # Random evaluations # Note, that the class level results for each repetition # are available as well multiclass_baseline$random_evaluations
Approach: The baseline model (y ~ 1)
, where 1
is simply the intercept (i.e. mean of y
), is fitted on n
random subsets of the training set and evaluated on the test set. We also perform an evaluation of the model fitted on the entire training set.
We usually wish to establish whether our predictors add anything useful to our model. We should thus at least do better than a model without any predictors.
Create the baseline evaluations:
gaussian_baseline <- baseline_gaussian( test_data = test_set, train_data = train_set, dependent_col = "score", n = 100 ) gaussian_baseline$summarized_metrics
The All_rows
row tells us the performance when fitting the intercept model on the full training set. It is quite close to the mean of the random evaluations.
gaussian_baseline$random_evaluations
Plot the density plot for RMSE
:
plot_metric_density(baseline = gaussian_baseline$random_evaluations, metric = "RMSE")
In this instance, the All_rows
row might have been enough, as the subsets mainly add higher RMSE
scores.
Instead of manually typing all possible model formulas for a set of fixed effects (including the possible interactions), combine_predictors()
can do it for you (with some constraints).
When including interactions, >200k formulas have been precomputed for up to 8 fixed effects, with a maximum interaction size of 3, and a maximum of 5 fixed effects per formula. It's possible to further limit the generated formulas.
We can also append a random effects structure to the generated formulas.
combine_predictors( dependent = "y", fixed_effects = c("a", "b", "c"), random_effects = "(1|d)" )
If two or more fixed effects should not be in the same formula, like an effect and its log-transformed version, we can provide them as sublists.
combine_predictors( dependent = "y", fixed_effects = list("a", list("b", "log_b")), random_effects = "(1|d)" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.