cross_validate: Cross-validate regression models for model selection

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/cross_validate.R

Description

\Sexpr[results=rd, stage=render]{lifecycle::badge("stable")}

Cross-validate one or multiple gaussian or binomial models at once. Perform repeated cross-validation. Returns results in a tibble for easy comparison, reporting and further analysis.

See cross_validate_fn() for use with custom model functions.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
cross_validate(
  data,
  models,
  fold_cols = ".folds",
  family = "gaussian",
  link = NULL,
  control = NULL,
  REML = FALSE,
  cutoff = 0.5,
  positive = 2,
  metrics = list(),
  rm_nc = FALSE,
  parallel = FALSE,
  model_verbose = FALSE
)

Arguments

data

Data frame.

Must include grouping factor for identifying folds - as made with groupdata2::fold().

models

Model formulas as strings. (Character)

E.g. c("y~x", "y~z").

Can contain random effects.

E.g. c("y~x+(1|r)", "y~z+(1|r)").

fold_cols

Name(s) of grouping factor(s) for identifying folds. (Character)

Include names of multiple grouping factors for repeated cross-validation.

family

Name of family. (Character)

Currently supports "gaussian" and "binomial".

link

Link function. (Character)

E.g. link = "log" with family = "gaussian" will use family = gaussian(link = "log").

See stats::family for available link functions.

Default link functions

Gaussian: 'identity'.

Binomial: 'logit'.

control

Construct control structures for mixed model fitting (i.e. lmer and glmer). See lme4::lmerControl and lme4::glmerControl.

N.B. Ignored if fitting lm or glm models.

REML

Restricted Maximum Likelihood. (Logical)

cutoff

Threshold for predicted classes. (Numeric)

N.B. Binomial models only

positive

Level from dependent variable to predict. Either as character or level index (1 or 2 - alphabetically).

E.g. if we have the levels "cat" and "dog" and we want "dog" to be the positive class, we can either provide "dog" or 2, as alphabetically, "dog" comes after "cat".

Used when calculating confusion matrix metrics and creating ROC curves.

N.B. Only affects evaluation metrics, not the model training or returned predictions.

N.B. Binomial models only.

metrics

List for enabling/disabling metrics.

E.g. list("RMSE" = FALSE) would remove RMSE from the results, and list("Accuracy" = TRUE) would add the regular accuracy metric to the classification results. Default values (TRUE/FALSE) will be used for the remaining metrics available.

Also accepts the string "all".

N.B. Currently, disabled metrics are still computed.

rm_nc

Remove non-converged models from output. (Logical)

parallel

Whether to cross-validate the list of models in parallel. (Logical)

Remember to register a parallel backend first. E.g. with doParallel::registerDoParallel.

model_verbose

Message name of used model function on each iteration. (Logical)

Details

Packages used:

Models

Gaussian: stats::lm, lme4::lmer

Binomial: stats::glm, lme4::glmer

Results

Gaussian

r2m : MuMIn::r.squaredGLMM

r2c : MuMIn::r.squaredGLMM

AIC : stats::AIC

AICc : MuMIn::AICc

BIC : stats::BIC

Binomial

Confusion matrix: caret::confusionMatrix

ROC: pROC::roc

MCC: mltools::mcc

Value

Tbl (tibble) with results for each model.

Shared across families

A nested tibble with coefficients of the models from all iterations.

Number of total folds.

Number of fold columns.

Count of convergence warnings. Consider discarding models that did not converge on all iterations. Note: you might still see results, but these should be taken with a grain of salt!

Count of other warnings. These are warnings without keywords such as "convergence".

Count of Singular Fit messages. See ?lme4::isSingular for more information.

Nested tibble with the warnings and messages caught for each model.

Specified family.

Specified link function.

Name of dependent variable.

Names of fixed effects.

Names of random effects, if any.

—————————————————————-

Gaussian Results

—————————————————————-

Average RMSE, MAE, r2m, r2c, AIC, AICc, and BIC of all the iterations*, omitting potential NAs from non-converged iterations. Note that the Information Criteria metrics (AIC, AICc, and BIC) are also averages.

A nested tibble with the predictions and targets.

A nested tibble with the non-averaged results from all iterations.

* In repeated cross-validation, the metrics are first averaged for each fold column (repetition) and then averaged again.

—————————————————————-

Binomial Results

—————————————————————-

Based on the collected predictions from the test folds*, a confusion matrix and a ROC curve are created to get the following:

ROC:

AUC, Lower CI, and Upper CI

Confusion Matrix:

Balanced Accuracy, F1, Sensitivity, Specificity, Positive Prediction Value, Negative Prediction Value, Kappa, Detection Rate, Detection Prevalence, Prevalence, and MCC (Matthews correlation coefficient).

Other available metrics (disabled by default, see metrics): Accuracy.

Also includes:

A nested tibble with predictions, predicted classes (depends on cutoff), and the targets. Note, that the predictions are not necessarily of the specified positive class, but of the model's positive class (second level of dependent variable, alphabetically).

A nested tibble with the sensativities and specificities from the ROC curve(s).

A nested tibble with the confusion matrix/matrices. The Pos_ columns tells you whether a row is a True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), depending on which level is the "positive" class. I.e. the level you wish to predict.

A nested tibble with the results from all fold columns, when using repeated cross-validation.

* In repeated cross-validation, an evaluation is made per fold column (repetition) and averaged.

Author(s)

Ludvig Renbo Olsen, [email protected]

Benjamin Hugh Zachariae

See Also

Other validation functions: cross_validate_fn(), validate()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# Attach packages
library(cvms)
library(groupdata2) # fold()
library(dplyr) # %>% arrange()

# Data is part of cvms
data <- participant.scores

# Set seed for reproducibility
set.seed(7)

# Fold data
data <- fold(data, k = 4,
             cat_col = 'diagnosis',
             id_col = 'participant') %>%
        arrange(.folds)

# Cross-validate a single model


# Gaussian
cross_validate(data,
               models = "score~diagnosis",
               family = 'gaussian',
               REML = FALSE)

# Binomial
cross_validate(data,
               models = "diagnosis~score",
               family='binomial')

# Cross-validate multiple models

models <- c("score~diagnosis+(1|session)",
            "score~age+(1|session)")

cross_validate(data,
               models = models,
               family = 'gaussian',
               REML = FALSE)

# Use non-default link functions

cross_validate(data,
               models = "score~diagnosis",
               family = 'gaussian',
               link = 'log',
               REML = FALSE)

# Use parallelization


# Attach doParallel and register four cores
# Uncomment:
# library(doParallel)
# registerDoParallel(4)

# Create list of 20 model formulas
models <- rep(c("score~diagnosis+(1|session)",
                "score~age+(1|session)"), 10)

# Cross-validate a list of 20 model formulas in parallel
system.time({cross_validate(data,
                            models = models,
                            family = 'gaussian',
                            parallel = TRUE)})

# Cross-validate a list of 20 model formulas sequentially
system.time({cross_validate(data,
                            models = models,
                            family = 'gaussian',
                            parallel = FALSE)})

LudvigOlsen/cvms documentation built on Dec. 9, 2019, 6:02 p.m.