knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) library(BORG)
A model shows 95% accuracy on test data, then drops to 60% in production. The usual culprit: data leakage.
Leakage happens when information from your test set contaminates training. Common causes:
Preprocessing (scaling, PCA) fitted on all data before splitting
Features derived from the outcome variable
Same patient/site appearing in both train and test
Random CV on spatially autocorrelated data
BORG checks for these problems before you compute metrics.
# Create sample data set.seed(42) data <- data.frame( x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100) ) # Define a split train_idx <- 1:70 test_idx <- 71:100 # Inspect the split result <- borg_inspect(data, train_idx = train_idx, test_idx = test_idx) result
No violations detected. But what if we made a mistake?
# Accidental overlap in indices bad_result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100) bad_result
BORG caught the overlap immediately.
borg()For most workflows, borg() is all you need. It handles two modes:
When you have structured data (spatial coordinates, time column, or groups), BORG diagnoses dependencies and generates appropriate CV folds:
# Spatial data with coordinates set.seed(42) spatial_data <- data.frame( lon = runif(200, -10, 10), lat = runif(200, -10, 10), elevation = rnorm(200, 500, 100), response = rnorm(200) ) # Let BORG diagnose and create CV folds result <- borg(spatial_data, coords = c("lon", "lat"), target = "response") result
BORG detected spatial structure and recommended spatial block CV instead of random CV.
When you have your own train/test indices, BORG validates them:
# Validate a manual split risk <- borg(spatial_data, train_idx = 1:150, test_idx = 151:200) risk
Use standard R plot() and summary():
# Plot the risk assessment plot(risk)
# Generate methods text for publications summary(result)
BORG handles three types of data dependencies:
Points close together tend to have similar values. Random CV underestimates error because train and test points are intermixed.
result_spatial <- borg(spatial_data, coords = c("lon", "lat"), target = "response") result_spatial$diagnosis@recommended_cv
Sequential observations are correlated. Future data must not leak into past predictions.
temporal_data <- data.frame( date = seq(as.Date("2020-01-01"), by = "day", length.out = 200), value = cumsum(rnorm(200)) ) result_temporal <- borg(temporal_data, time = "date", target = "value") result_temporal$diagnosis@recommended_cv
Observations within groups (patients, sites, species) are more similar than between groups.
grouped_data <- data.frame( site = rep(1:20, each = 10), measurement = rnorm(200) ) result_grouped <- borg(grouped_data, groups = "site", target = "measurement") result_grouped$diagnosis@recommended_cv
BORG classifies risks into two categories:
These invalidate your results completely:
| Risk | Description |
|------|-------------|
| index_overlap | Same row in both train and test |
| duplicate_rows | Identical observations in train and test |
| target_leakage | Feature with |r| > 0.99 with target |
| group_leakage | Same group in train and test |
| temporal_leakage | Test data predates training data |
| preprocessing_leakage | Scaler/PCA fitted on full data |
These inflate metrics but don't completely invalidate:
| Risk | Description |
|------|-------------|
| proxy_leakage | Feature with |r| 0.95-0.99 with target |
| spatial_proximity | Test points too close to train |
| random_cv_inflation | Random CV on dependent data |
Features derived from the outcome:
# Simulate target leakage leaky_data <- data.frame( x = rnorm(100), leaked_feature = rnorm(100), # Will be made leaky outcome = rnorm(100) ) # Make leaked_feature highly correlated with outcome leaky_data$leaked_feature <- leaky_data$outcome + rnorm(100, sd = 0.05) result <- borg_inspect(leaky_data, train_idx = 1:70, test_idx = 71:100, target = "outcome") result
Same entity in train and test:
# Simulate clinical data with patient IDs clinical_data <- data.frame( patient_id = rep(1:10, each = 10), visit = rep(1:10, times = 10), measurement = rnorm(100) ) # Random split ignoring patients (BAD) set.seed(123) all_idx <- sample(100) train_idx <- all_idx[1:70] test_idx <- all_idx[71:100] # Check for group leakage result <- borg_inspect(clinical_data, train_idx = train_idx, test_idx = test_idx, groups = "patient_id") result
Access the generated folds directly:
result <- borg(spatial_data, coords = c("lon", "lat"), target = "response", v = 5) # Number of folds length(result$folds) # First fold's train/test sizes cat("Fold 1 - Train:", length(result$folds[[1]]$train), "Test:", length(result$folds[[1]]$test), "\n")
For reproducibility, export validation certificates:
# Create a certificate cert <- borg_certificate(result$diagnosis, data = spatial_data) cert
# Export to file borg_export(result$diagnosis, spatial_data, "validation.yaml") borg_export(result$diagnosis, spatial_data, "validation.json")
summary() generates publication-ready methods paragraphs that include the
statistical tests BORG ran, the dependency type detected, and the CV strategy
chosen. Three citation styles are supported:
# Default APA style result <- borg(spatial_data, coords = c("lon", "lat"), target = "response") methods_text <- summary(result)
# Nature style summary(result, style = "nature") # Ecology style summary(result, style = "ecology")
The returned text is a character string you can paste directly into a manuscript.
If you also ran borg_compare_cv(), pass the comparison object to include
empirical inflation estimates:
comparison <- borg_compare_cv(spatial_data, response ~ lon + lat, coords = c("lon", "lat")) summary(result, comparison = comparison)
When reviewers ask "does it really matter?", borg_compare_cv() runs both
random and blocked CV on the same data and model, then tests whether the
difference is statistically significant:
comparison <- borg_compare_cv( spatial_data, formula = response ~ lon + lat, coords = c("lon", "lat"), v = 5, repeats = 5 # Use more repeats in practice ) print(comparison)
plot(comparison)
Switching from random to blocked CV reduces effective sample size. Before committing to blocked CV, check whether your dataset is large enough:
# Clustered data: 20 sites, 10 observations each clustered_data <- data.frame( site = rep(1:20, each = 10), value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5) ) pw <- borg_power(clustered_data, groups = "site", target = "value") print(pw) summary(pw)
| Function | Purpose |
|----------|---------|
| borg() | Main entry point — diagnose data or validate splits |
| borg_inspect() | Detailed inspection of train/test split |
| borg_diagnose() | Analyze data dependencies only |
| borg_compare_cv() | Empirical random vs blocked CV comparison |
| borg_power() | Power analysis after blocking |
| plot() | Visualize results |
| summary() | Generate methods text for papers |
| borg_certificate() | Create validation certificate |
| borg_export() | Export certificate to YAML/JSON |
vignette("risk-taxonomy") - Complete catalog of detectable risks
vignette("frameworks") - Integration with caret, tidymodels, mlr3
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.