borg_cv: Generate Valid Cross-Validation Scheme
In BORG: Bounded Outcome Risk Guard for Model Evaluation

borg_cv

R Documentation

Generate Valid Cross-Validation Scheme

Description

Creates cross-validation folds that respect data dependency structure. When spatial, temporal, or clustered dependencies are detected, random CV is disabled and appropriate blocking strategies are enforced.

Usage

borg_cv(
  data,
  diagnosis = NULL,
  v = 5,
  coords = NULL,
  time = NULL,
  groups = NULL,
  target = NULL,
  block_size = NULL,
  embargo = NULL,
  strategy = NULL,
  output = c("list", "rsample", "caret", "mlr3"),
  allow_random = FALSE,
  verbose = FALSE
)

Arguments

`data`	A data frame to create CV folds for.
`diagnosis`	A `BorgDiagnosis` object from `borg_diagnose`. If NULL, diagnosis is performed automatically.
`v`	Integer. Number of folds. Default: 5.
`coords`	Character vector of length 2 specifying coordinate column names. Required for spatial blocking if diagnosis is NULL.
`time`	Character string specifying the time column name. Required for temporal blocking if diagnosis is NULL.
`groups`	Character string specifying the grouping column name. Required for group CV if diagnosis is NULL.
`target`	Character string specifying the response variable column name.
`block_size`	Numeric. For spatial blocking, the minimum block size. If NULL, automatically determined from diagnosis. Should be larger than the autocorrelation range.
`embargo`	Integer. For temporal blocking, minimum gap between train and test. If NULL, automatically determined from diagnosis.
`strategy`	Character. Override the auto-detected CV strategy. Use `"temporal_expanding"` for expanding window (forward-chaining) or `"temporal_sliding"` for fixed-size sliding window. Default: NULL (auto-detect from diagnosis).
`output`	Character. Output format: "list" (default), "rsample", "caret", "mlr3".
`allow_random`	Logical. If TRUE, allows random CV even when dependencies detected. Default: FALSE. Setting to TRUE requires explicit acknowledgment.
`verbose`	Logical. If TRUE, print diagnostic messages. Default: FALSE.

Details

The Enforcement Principle

Unlike traditional CV helpers, borg_cv enforces valid evaluation:

If spatial autocorrelation is detected, random CV is disabled
If temporal autocorrelation is detected, random CV is disabled
If clustered structure is detected, random CV is disabled
To use random CV on dependent data, you must set allow_random = TRUE and provide justification (this is logged).

Spatial Blocking

When spatial dependencies are detected, data are partitioned into spatial blocks using k-means clustering on coordinates. Block size is set to exceed the estimated autocorrelation range. This ensures train and test sets are spatially separated.

Temporal Blocking

When temporal dependencies are detected, data are split chronologically with an embargo period between train and test sets. This prevents information from future observations leaking into training.

Group CV

When clustered structure is detected, entire groups (clusters) are held out together. No group appears in both train and test within a fold.

Value

Depending on output:

"list": A list with elements: folds (list of train/test index vectors), diagnosis (the BorgDiagnosis used), strategy (CV strategy name), params (parameters used).
"rsample": An rsample rset object compatible with tidymodels.
"caret": A trainControl object for caret.
"mlr3": An mlr3 Resampling object.

Examples

# Spatial data with autocorrelation
set.seed(42)
spatial_data <- data.frame(
  x = runif(200, 0, 100),
  y = runif(200, 0, 100),
  response = rnorm(200)
)

# Diagnose and create CV
cv <- borg_cv(spatial_data, coords = c("x", "y"), target = "response")
str(cv$folds)  # List of train/test indices

# Clustered data
clustered_data <- data.frame(
  site = rep(1:20, each = 10),
  value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)

cv <- borg_cv(clustered_data, groups = "site", target = "value")
cv$strategy  # "group_fold"

# Get rsample-compatible output for tidymodels

cv_rsample <- borg_cv(spatial_data, coords = c("x", "y"), output = "rsample")

BORG documentation built on March 20, 2026, 5:09 p.m.