Quick Start"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
library(BORG)

Why Your Test Accuracy Might Be Wrong

A model shows 95% accuracy on test data, then drops to 60% in production. The usual culprit: data leakage.

Leakage happens when information from your test set contaminates training. Common causes:

BORG checks for these problems before you compute metrics.

Basic Usage

# Create sample data
set.seed(42)
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  y = rnorm(100)
)

# Define a split
train_idx <- 1:70
test_idx <- 71:100

# Inspect the split
result <- borg_inspect(data, train_idx = train_idx, test_idx = test_idx)
result

No violations detected. But what if we made a mistake?

# Accidental overlap in indices
bad_result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
bad_result

BORG caught the overlap immediately.

The Main Entry Point: borg()

For most workflows, borg() is all you need. It handles two modes:

Mode 1: Diagnose Data Dependencies

When you have structured data (spatial coordinates, time column, or groups), BORG diagnoses dependencies and generates appropriate CV folds:

# Spatial data with coordinates
set.seed(42)
spatial_data <- data.frame(
  lon = runif(200, -10, 10),
  lat = runif(200, -10, 10),
  elevation = rnorm(200, 500, 100),
  response = rnorm(200)
)

# Let BORG diagnose and create CV folds
result <- borg(spatial_data, coords = c("lon", "lat"), target = "response")
result

BORG detected spatial structure and recommended spatial block CV instead of random CV.

Mode 2: Validate Existing Splits

When you have your own train/test indices, BORG validates them:

# Validate a manual split
risk <- borg(spatial_data, train_idx = 1:150, test_idx = 151:200)
risk

Visualizing Results

Use standard R plot() and summary():

# Plot the risk assessment
plot(risk)
# Generate methods text for publications
summary(result)

Data Dependency Types

BORG handles three types of data dependencies:

Spatial Autocorrelation

Points close together tend to have similar values. Random CV underestimates error because train and test points are intermixed.

result_spatial <- borg(spatial_data, coords = c("lon", "lat"), target = "response")
result_spatial$diagnosis@recommended_cv

Temporal Autocorrelation

Sequential observations are correlated. Future data must not leak into past predictions.

temporal_data <- data.frame(
  date = seq(as.Date("2020-01-01"), by = "day", length.out = 200),
  value = cumsum(rnorm(200))
)

result_temporal <- borg(temporal_data, time = "date", target = "value")
result_temporal$diagnosis@recommended_cv

Clustered/Grouped Data

Observations within groups (patients, sites, species) are more similar than between groups.

grouped_data <- data.frame(
  site = rep(1:20, each = 10),
  measurement = rnorm(200)
)

result_grouped <- borg(grouped_data, groups = "site", target = "measurement")
result_grouped$diagnosis@recommended_cv

Risk Categories

BORG classifies risks into two categories:

Hard Violations (Evaluation Invalid)

These invalidate your results completely:

| Risk | Description | |------|-------------| | index_overlap | Same row in both train and test | | duplicate_rows | Identical observations in train and test | | target_leakage | Feature with |r| > 0.99 with target | | group_leakage | Same group in train and test | | temporal_leakage | Test data predates training data | | preprocessing_leakage | Scaler/PCA fitted on full data |

Soft Inflation (Results Biased)

These inflate metrics but don't completely invalidate:

| Risk | Description | |------|-------------| | proxy_leakage | Feature with |r| 0.95-0.99 with target | | spatial_proximity | Test points too close to train | | random_cv_inflation | Random CV on dependent data |

Detecting Specific Leakage Types

Target Leakage

Features derived from the outcome:

# Simulate target leakage
leaky_data <- data.frame(
  x = rnorm(100),
  leaked_feature = rnorm(100),  # Will be made leaky
  outcome = rnorm(100)
)
# Make leaked_feature highly correlated with outcome
leaky_data$leaked_feature <- leaky_data$outcome + rnorm(100, sd = 0.05)

result <- borg_inspect(leaky_data, train_idx = 1:70, test_idx = 71:100,
                       target = "outcome")
result

Group Leakage

Same entity in train and test:

# Simulate clinical data with patient IDs
clinical_data <- data.frame(
  patient_id = rep(1:10, each = 10),
  visit = rep(1:10, times = 10),
  measurement = rnorm(100)
)

# Random split ignoring patients (BAD)
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]

# Check for group leakage
result <- borg_inspect(clinical_data, train_idx = train_idx, test_idx = test_idx,
                       groups = "patient_id")
result

Working with CV Folds

Access the generated folds directly:

result <- borg(spatial_data, coords = c("lon", "lat"), target = "response", v = 5)

# Number of folds
length(result$folds)

# First fold's train/test sizes
cat("Fold 1 - Train:", length(result$folds[[1]]$train),
    "Test:", length(result$folds[[1]]$test), "\n")

Exporting Results

For reproducibility, export validation certificates:

# Create a certificate
cert <- borg_certificate(result$diagnosis, data = spatial_data)
cert
# Export to file
borg_export(result$diagnosis, spatial_data, "validation.yaml")
borg_export(result$diagnosis, spatial_data, "validation.json")

Writing Methods Sections

summary() generates publication-ready methods paragraphs that include the statistical tests BORG ran, the dependency type detected, and the CV strategy chosen. Three citation styles are supported:

# Default APA style
result <- borg(spatial_data, coords = c("lon", "lat"), target = "response")
methods_text <- summary(result)
# Nature style
summary(result, style = "nature")

# Ecology style
summary(result, style = "ecology")

The returned text is a character string you can paste directly into a manuscript. If you also ran borg_compare_cv(), pass the comparison object to include empirical inflation estimates:

comparison <- borg_compare_cv(spatial_data, response ~ lon + lat,
                              coords = c("lon", "lat"))
summary(result, comparison = comparison)

Empirical CV Comparison

When reviewers ask "does it really matter?", borg_compare_cv() runs both random and blocked CV on the same data and model, then tests whether the difference is statistically significant:

comparison <- borg_compare_cv(
  spatial_data,
  formula = response ~ lon + lat,
  coords = c("lon", "lat"),
  v = 5,
  repeats = 5  # Use more repeats in practice
)
print(comparison)
plot(comparison)

Power Analysis After Blocking

Switching from random to blocked CV reduces effective sample size. Before committing to blocked CV, check whether your dataset is large enough:

# Clustered data: 20 sites, 10 observations each
clustered_data <- data.frame(
  site = rep(1:20, each = 10),
  value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)

pw <- borg_power(clustered_data, groups = "site", target = "value")
print(pw)
summary(pw)

Interface Summary

| Function | Purpose | |----------|---------| | borg() | Main entry point — diagnose data or validate splits | | borg_inspect() | Detailed inspection of train/test split | | borg_diagnose() | Analyze data dependencies only | | borg_compare_cv() | Empirical random vs blocked CV comparison | | borg_power() | Power analysis after blocking | | plot() | Visualize results | | summary() | Generate methods text for papers | | borg_certificate() | Create validation certificate | | borg_export() | Export certificate to YAML/JSON |

See Also



Try the BORG package in your browser

Any scripts or data that you put into this service are public.

BORG documentation built on March 20, 2026, 5:09 p.m.