| borg_cv | R Documentation |
Creates cross-validation folds that respect data dependency structure. When spatial, temporal, or clustered dependencies are detected, random CV is disabled and appropriate blocking strategies are enforced.
borg_cv(
data,
diagnosis = NULL,
v = 5,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
block_size = NULL,
embargo = NULL,
strategy = NULL,
output = c("list", "rsample", "caret", "mlr3"),
allow_random = FALSE,
verbose = FALSE
)
data |
A data frame to create CV folds for. |
diagnosis |
A |
v |
Integer. Number of folds. Default: 5. |
coords |
Character vector of length 2 specifying coordinate column names. Required for spatial blocking if diagnosis is NULL. |
time |
Character string specifying the time column name. Required for temporal blocking if diagnosis is NULL. |
groups |
Character string specifying the grouping column name. Required for group CV if diagnosis is NULL. |
target |
Character string specifying the response variable column name. |
block_size |
Numeric. For spatial blocking, the minimum block size. If NULL, automatically determined from diagnosis. Should be larger than the autocorrelation range. |
embargo |
Integer. For temporal blocking, minimum gap between train and test. If NULL, automatically determined from diagnosis. |
strategy |
Character. Override the auto-detected CV strategy. Use
|
output |
Character. Output format: "list" (default), "rsample", "caret", "mlr3". |
allow_random |
Logical. If TRUE, allows random CV even when dependencies detected. Default: FALSE. Setting to TRUE requires explicit acknowledgment. |
verbose |
Logical. If TRUE, print diagnostic messages. Default: FALSE. |
Unlike traditional CV helpers, borg_cv enforces valid evaluation:
If spatial autocorrelation is detected, random CV is disabled
If temporal autocorrelation is detected, random CV is disabled
If clustered structure is detected, random CV is disabled
To use random CV on dependent data, you must set allow_random = TRUE
and provide justification (this is logged).
When spatial dependencies are detected, data are partitioned into spatial blocks using k-means clustering on coordinates. Block size is set to exceed the estimated autocorrelation range. This ensures train and test sets are spatially separated.
When temporal dependencies are detected, data are split chronologically with an embargo period between train and test sets. This prevents information from future observations leaking into training.
When clustered structure is detected, entire groups (clusters) are held out together. No group appears in both train and test within a fold.
Depending on output:
A list with elements: folds (list of train/test index vectors),
diagnosis (the BorgDiagnosis used), strategy (CV strategy name),
params (parameters used).
An rsample rset object compatible with tidymodels.
A trainControl object for caret.
An mlr3 Resampling object.
borg_diagnose, BorgDiagnosis
# Spatial data with autocorrelation
set.seed(42)
spatial_data <- data.frame(
x = runif(200, 0, 100),
y = runif(200, 0, 100),
response = rnorm(200)
)
# Diagnose and create CV
cv <- borg_cv(spatial_data, coords = c("x", "y"), target = "response")
str(cv$folds) # List of train/test indices
# Clustered data
clustered_data <- data.frame(
site = rep(1:20, each = 10),
value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)
cv <- borg_cv(clustered_data, groups = "site", target = "value")
cv$strategy # "group_fold"
# Get rsample-compatible output for tidymodels
cv_rsample <- borg_cv(spatial_data, coords = c("x", "y"), output = "rsample")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.