sdmTMB_cv: Cross validation with sdmTMB models
In sdmTMB: Spatial and Spatiotemporal SPDE-Based GLMMs with 'TMB'

sdmTMB_cv

R Documentation

Cross validation with sdmTMB models

Description

Facilitates cross validation with sdmTMB models. Returns the log likelihood of left-out data, which is similar in spirit to the ELPD (expected log pointwise predictive density). The function has an option for leave-future-out cross validation. By default, the function creates folds randomly but folds can be manually assigned via the fold_ids argument.

Usage

sdmTMB_cv(
  formula,
  data,
  mesh_args,
  mesh = NULL,
  time = NULL,
  k_folds = 8,
  fold_ids = NULL,
  lfo = FALSE,
  lfo_forecast = 1,
  lfo_validations = 5,
  parallel = TRUE,
  use_initial_fit = FALSE,
  future_globals = NULL,
  ...
)

Arguments

`formula`	Model formula.
`data`	A data frame.
`mesh_args`	Arguments for `make_mesh()`. If supplied, the mesh will be reconstructed for each fold.
`mesh`	Output from `make_mesh()`. If supplied, the mesh will be constant across folds.
`time`	The name of the time column. Leave as `NULL` if this is only spatial data.
`k_folds`	Number of folds.
`fold_ids`	Optional vector containing user fold IDs. Can also be a single string, e.g. `"fold_id"` representing the name of the variable in `data`. Ignored if `lfo` is TRUE
`lfo`	Whether to implement leave-future-out (LFO) cross validation where data are used to predict future folds. `time` argument in `sdmTMB()` must be specified. See Details section below.
`lfo_forecast`	If `lfo = TRUE`, number of time steps to forecast. Time steps 1, ..., T are used to predict T + `lfo_forecast` and the last forecasted time step is used for validation. See Details section below.
`lfo_validations`	If `lfo = TRUE`, number of times to step through the LFOCV process. Defaults to 5. See Details section below.
`parallel`	If `TRUE` and a `future::plan()` is supplied, will be run in parallel.
`use_initial_fit`	Fit the first fold and use those parameter values as starting values for subsequent folds? Can be faster with many folds.
`future_globals`	A character vector of global variables used within arguments if an error is returned that future.apply can't find an object. This vector is appended to `TRUE` and passed to the argument `future.globals` in `future.apply::future_lapply()`. Useful if global objects are used to specify arguments like priors, families, etc.
`...`	All other arguments required to run `sdmTMB()` model with the exception of `weights`, which are used to define the folds.

Details

Parallel processing

Parallel processing can be used by setting a future::plan().

For example:

library(future)
plan(multisession)
# now use sdmTMB_cv() ...

Leave-future-out cross validation (LFOCV)

An example of LFOCV with 9 time steps, lfo_forecast = 1, and lfo_validations = 2:

Fit data to time steps 1 to 7, predict and validate step 8.
Fit data to time steps 1 to 8, predict and validate step 9.

An example of LFOCV with 9 time steps, lfo_forecast = 2, and lfo_validations = 3:

Fit data to time steps 1 to 5, predict and validate step 7.
Fit data to time steps 1 to 6, predict and validate step 8.
Fit data to time steps 1 to 7, predict and validate step 9.

Note these are time steps as they are presented in order in the data. For example, in the pcod data example below steps between data points are not always one year but an lfo_forecast = 2 forecasts 2 time steps as presented not two years.

See example below.

Value

A list:

data: Original data plus columns for fold ID, CV predicted value, and CV log likelihood.
models: A list of models; one per fold.
fold_loglik: Sum of left-out log likelihoods per fold. More positive values are better.
sum_loglik: Sum of fold_loglik across all left-out data. More positive values are better.
pdHess: Logical vector: Hessian was invertible each fold?
converged: Logical: all pdHess TRUE?
max_gradients: Max gradient per fold.

Prior to sdmTMB version '0.3.0.9002', elpd was incorrectly returned as the log average likelihood, which is another metric you could compare models with, but not ELPD. For maximum likelihood, ELPD is equivalent in spirit to the sum of the log likelihoods.

Examples

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)

# Set parallel processing first if desired with the future package.
# See the Details section above.

m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh, spatial = "off",
  family = tweedie(link = "log"), k_folds = 2
)

m_cv$fold_loglik
m_cv$sum_loglik

head(m_cv$data)
m_cv$models[[1]]
m_cv$max_gradients



# Create mesh each fold:
m_cv2 <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh_args = list(xy_cols = c("X", "Y"), cutoff = 20),
  family = tweedie(link = "log"), k_folds = 2
)

# Use fold_ids:
m_cv3 <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"),
  fold_ids = rep(seq(1, 3), nrow(pcod))[seq(1, nrow(pcod))]
)

# LFOCV:
m_lfocv <- sdmTMB_cv(
  present ~ s(year, k = 4),
  data = pcod,
  lfo = TRUE,
  lfo_forecast = 2,
  lfo_validations = 3,
  family = binomial(),
  mesh = mesh,
  spatial = "off", # fast example
  spatiotemporal = "off", # fast example
  time = "year" # must be specified
)

# See how the LFOCV folds were assigned:
fold_table <- table(m_lfocv$data$cv_fold, m_lfocv$data$year)
fold_table

sdmTMB documentation built on Aug. 8, 2025, 6:38 p.m.