baseline_gaussian: Create baseline evaluations for regression models
In cvms: Cross-Validation for Model Selection

baseline_gaussian

R Documentation

Create baseline evaluations for regression models

Description

\Sexpr[results=rd, stage=render]{lifecycle::badge("maturing")}

Create a baseline evaluation of a test set.

In modelling, a baseline is a result that is meaningful to compare the results from our models to. In regression, we want our model to be better than a model without any predictors. If our model does not perform better than such a simple model, it's unlikely to be useful.

baseline_gaussian() fits the intercept-only model (y ~ 1) on `n` random subsets of `train_data` and evaluates each model on `test_data`. Additionally, it evaluates a model fitted on all rows in `train_data`.

Usage

baseline_gaussian(
  test_data,
  train_data,
  dependent_col,
  n = 100,
  metrics = list(),
  random_effects = NULL,
  min_training_rows = 5,
  min_training_rows_left_out = 3,
  REML = FALSE,
  parallel = FALSE
)

Arguments

`test_data`	`data.frame`.
`train_data`	`data.frame`.
`dependent_col`	Name of dependent variable in the supplied test and training sets.
`n`	The number of random samplings of `train_data` to fit baseline models on. (Default is `100`)
`metrics`	`list` for enabling/disabling metrics. E.g. `list("RMSE" = FALSE)` would remove `RMSE` from the results, and `list("TAE" = TRUE)` would add the `Total Absolute Error` metric to the results. Default values (`TRUE`/`FALSE`) will be used for the remaining available metrics. You can enable/disable all metrics at once by including `"all" = TRUE/FALSE` in the `list`. This is done prior to enabling/disabling individual metrics, why f.i. `list("all" = FALSE, "RMSE" = TRUE)` would return only the `RMSE` metric. The `list` can be created with `gaussian_metrics()`. Also accepts the string `"all"`.
`random_effects`	Random effects structure for the baseline model. (Character) E.g. with `"(1\|ID)"`, the model becomes `"y ~ 1 + (1\|ID)"`.
`min_training_rows`	Minimum number of rows in the random subsets of `train_data`.
`min_training_rows_left_out`	Minimum number of rows left out of the random subsets of `train_data`. I.e. a subset will maximally have the size: max_rows_in_subset = nrow(`train_data`) - `min_training_rows_left_out`.
`REML`	Whether to use Restricted Maximum Likelihood. (Logical)
`parallel`	Whether to run the `n` evaluations in parallel. (Logical) Remember to register a parallel backend first. E.g. with `doParallel::registerDoParallel`.

Details

Packages used:

Models

stats::lm, lme4::lmer

Results

r2m : MuMIn::r.squaredGLMM

r2c : MuMIn::r.squaredGLMM

AIC : stats::AIC

AICc : MuMIn::AICc

BIC : stats::BIC

Value

list containing:

a tibble with summarized results (called summarized_metrics)
a tibble with random evaluations (random_evaluations)

....................................................................

The Summarized Results tibble contains:

Average RMSE, MAE, NRMSE(IQR), RRSE, RAE, RMSLE.

See the additional metrics (disabled by default) at ?gaussian_metrics.

The Measure column indicates the statistical descriptor used on the evaluations. The row where Measure == All_rows is the evaluation when the baseline model is trained on all rows in `train_data`.

The Training Rows column contains the aggregated number of rows used from `train_data`, when fitting the baseline models.

....................................................................

The Random Evaluations tibble contains:

The non-aggregated metrics.

A nested tibble with the predictions and targets.

A nested tibble with the coefficients of the baseline models.

Number of training rows used when fitting the baseline model on the training set.

A nested Process information object with information about the evaluation.

Name of dependent variable.

Name of fixed effect (bias term only).

Random effects structure (if specified).

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Examples


# Attach packages
library(cvms)
library(groupdata2) # partition()
library(dplyr) # %>% arrange()

# Data is part of cvms
data <- participant.scores

# Set seed for reproducibility
set.seed(1)

# Partition data
partitions <- partition(data, p = 0.7, list_out = TRUE)
train_set <- partitions[[1]]
test_set <- partitions[[2]]

# Create baseline evaluations
# Note: usually n=100 is a good setting

baseline_gaussian(
  test_data = test_set,
  train_data = train_set,
  dependent_col = "score",
  random_effects = "(1|session)",
  n = 2
)

# Parallelize evaluations

# Attach doParallel and register four cores
# Uncomment:
# library(doParallel)
# registerDoParallel(4)

# Make sure to uncomment the parallel argument
baseline_gaussian(
  test_data = test_set,
  train_data = train_set,
  dependent_col = "score",
  random_effects = "(1|session)",
  n = 4
  #, parallel = TRUE  # Uncomment
)

cvms documentation built on April 4, 2025, 12:18 a.m.