Risk Taxonomy"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
library(BORG)

This document catalogs all evaluation risks that BORG detects, organized by severity and mechanism.

Risk Classification

BORG classifies risks into two categories based on their impact on evaluation validity:

| Category | Impact | BORG Response | |----------|--------|---------------| | Hard Violation | Results are invalid | Blocks evaluation, requires fix | | Soft Inflation | Results are biased | Warns, allows with caution |

Hard Violations

These make your evaluation results invalid. Any metrics computed with these violations are unreliable.

1. Index Overlap

What: Same row indices appear in both training and test sets.

Why it matters: The model has seen the exact data it's being tested on. This is the most basic form of leakage.

Detection: Set intersection of train_idx and test_idx.

data <- data.frame(x = 1:100, y = rnorm(100))

# Accidental overlap
result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
result

Fix: Ensure indices are mutually exclusive. Use setdiff() to create non-overlapping sets.

2. Duplicate Rows

What: Test set contains rows identical to training rows.

Why it matters: Model may have memorized these exact patterns. Even without index overlap, identical feature values constitute leakage.

Detection: Row hashing and comparison (C++ backend for numeric data).

# Data with duplicate rows
dup_data <- rbind(
  data.frame(x = 1:5, y = 1:5),
  data.frame(x = 1:5, y = 1:5)  # Duplicates
)

result <- borg_inspect(dup_data, train_idx = 1:5, test_idx = 6:10)
result

Fix: Remove duplicate rows before splitting, or ensure splits respect duplicates (keep all copies in same set).

3. Preprocessing Leakage

What: Normalization, imputation, or dimensionality reduction fitted on full data before splitting.

Why it matters: Test set statistics influenced the preprocessing parameters applied to training data. Information flows backwards from test to train.

Detection: Recompute statistics on train-only data and compare to stored parameters. Discrepancy indicates leakage.

Supported objects:

| Object Type | Parameters Checked | |-------------|-------------------| | caret::preProcess | $mean, $std | | recipes::recipe | Step parameters after prep() | | prcomp | $center, $scale, rotation matrix | | scale() attributes | center, scale |

# BAD: Scale fitted on all data
scaled_data <- scale(data)  # Uses all rows!
train <- scaled_data[1:70, ]
test <- scaled_data[71:100, ]

# BORG detects this
borg_inspect(scaled_data, train_idx = 1:70, test_idx = 71:100)

Fix: Fit preprocessing on training data only, then apply to test:

train_data <- data[1:70, ]
test_data <- data[71:100, ]

# Fit on train
means <- colMeans(train_data)
sds <- apply(train_data, 2, sd)

# Apply to both
train_scaled <- scale(train_data, center = means, scale = sds)
test_scaled <- scale(test_data, center = means, scale = sds)

4. Target Leakage (Direct)

What: Feature has absolute correlation > 0.99 with target.

Why it matters: Feature is almost certainly derived from the outcome. Examples: - days_since_diagnosis when predicting has_disease

Detection: Compute Pearson correlation of each numeric feature with target on training data.

# Simulate target leakage
leaky <- data.frame(
  x = rnorm(100),
  outcome = rnorm(100)
)
leaky$leaked <- leaky$outcome + rnorm(100, sd = 0.01)  # Near-perfect correlation

result <- borg_inspect(leaky, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result

Fix: Remove or investigate the leaky feature. If it's a legitimate predictor, document why correlation > 0.99 is expected.

5. Group Leakage

What: Same group (patient, site, species) appears in both train and test.

Why it matters: Observations within a group tend to be similar. If the same patient appears in train and test, the model can exploit patient-specific patterns that won't exist for new patients.

Detection: Set intersection of group membership values.

# Clinical data with patient IDs
clinical <- data.frame(
  patient_id = rep(1:10, each = 10),
  measurement = rnorm(100)
)

# Random split ignoring patients
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]

result <- borg_inspect(clinical, train_idx = train_idx, test_idx = test_idx,
                       groups = "patient_id")
result

Fix: Use group-aware splitting:

# Split at the patient level
train_patients <- sample(unique(clinical$patient_id), 7)
train_idx <- which(clinical$patient_id %in% train_patients)
test_idx <- which(!clinical$patient_id %in% train_patients)

6. Temporal Ordering Violation

What: Test observations predate training observations.

Why it matters: Model uses future information to predict the past. In deployment, future data won't be available.

Detection: Compare max training timestamp to min test timestamp.

# Time series data
ts_data <- data.frame(
  date = seq(as.Date("2020-01-01"), by = "day", length.out = 100),
  value = cumsum(rnorm(100))
)

# Wrong: random split ignores time
set.seed(42)
random_idx <- sample(100)
train_idx <- random_idx[1:70]
test_idx <- random_idx[71:100]

result <- borg_inspect(ts_data, train_idx = train_idx, test_idx = test_idx,
                       time = "date")
result

Fix: Use chronological splits where all test data comes after training:

train_idx <- 1:70
test_idx <- 71:100

7. CV Fold Contamination

What: Cross-validation folds contain test indices, or folds overlap incorrectly.

Why it matters: Nested CV requires the outer test set to be completely held out from all inner training.

Detection: Check if any fold's training indices intersect with held-out test set.

Supported objects:

8. Model Scope

What: Model was trained on more rows than claimed training set.

Why it matters: Model saw test data during training, even if indirectly (e.g., through hyperparameter tuning on full data).

Detection: Compare nrow(trainingData) or length(fitted.values) to length(train_idx).

Supported objects: lm, glm, ranger, caret::train, parsnip models, workflows.

Soft Inflation Risks

These bias results but may not completely invalidate them. Model ranking might be preserved even if absolute metrics are optimistic.

1. Target Leakage (Proxy)

What: Feature has correlation 0.95-0.99 with target.

Why warning not error: May be a legitimate strong predictor. Requires domain knowledge to judge.

Detection: Same as direct leakage, different threshold.

# Strong but not extreme correlation
proxy <- data.frame(
  x = rnorm(100),
  outcome = rnorm(100)
)
proxy$strong_predictor <- proxy$outcome + rnorm(100, sd = 0.3)  # r ~ 0.96

result <- borg_inspect(proxy, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result

Action: Review whether the feature should be available at prediction time in production.

2. Spatial Proximity

What: Test points are very close to training points in geographic space.

Why it matters: Spatial autocorrelation means nearby points share variance. Model learns local patterns that don't generalize to distant locations.

Detection: Compute minimum distance from each test point to nearest training point. Flag if < 1% of spatial spread.

set.seed(42)
spatial <- data.frame(
  lon = runif(100, 0, 100),
  lat = runif(100, 0, 100),
  value = rnorm(100)
)

# Random split intermixes nearby points
train_idx <- sample(100, 70)
test_idx <- setdiff(1:100, train_idx)

result <- borg_inspect(spatial, train_idx = train_idx, test_idx = test_idx,
                       coords = c("lon", "lat"))
result

Fix: Use spatial blocking:

# Geographic split
train_idx <- which(spatial$lon < 50)  # West
test_idx <- which(spatial$lon >= 50)  # East

3. Spatial Overlap

What: Test region falls inside training region's convex hull.

Why it matters: Interpolation is easier than extrapolation. Model performance on "surrounded" test points overestimates performance on truly new regions.

Detection: Compute convex hull of training points, count test points inside.

Threshold: Warning if > 50% of test points fall inside training hull.

4. Random CV on Dependent Data

What: Using random k-fold CV when data has spatial, temporal, or group structure.

Why it matters: Random folds break dependencies artificially, leading to optimistic error estimates.

# Diagnose data dependencies
spatial <- data.frame(
  lon = runif(200, 0, 100),
  lat = runif(200, 0, 100),
  response = rnorm(200)
)

diagnosis <- borg_diagnose(spatial, coords = c("lon", "lat"), target = "response",
                           verbose = FALSE)
diagnosis@recommended_cv

Fix: Use borg() to generate appropriate blocked CV folds.

Quick Reference

| Risk Type | Severity | Detection Method | Fix | |-----------|----------|------------------|-----| | index_overlap | Hard | Index intersection | Use setdiff() | | duplicate_rows | Hard | Row hashing | Deduplicate or group | | preprocessing_leak | Hard | Parameter comparison | Fit on train only | | target_leakage | Hard | Correlation > 0.99 | Remove feature | | group_leakage | Hard | Group intersection | Group-aware split | | temporal_leak | Hard | Timestamp comparison | Chronological split | | cv_contamination | Hard | Fold index check | Rebuild folds | | model_scope | Hard | Row count | Refit on train only | | proxy_leakage | Soft | Correlation 0.95-0.99 | Domain review | | spatial_proximity | Soft | Distance check | Spatial blocking | | spatial_overlap | Soft | Convex hull | Geographic split |

Accessing Risk Details

# Create result with violations
result <- borg_inspect(
  data.frame(x = 1:100, y = rnorm(100)),
  train_idx = 1:60,
  test_idx = 51:100
)

# Summary
cat("Valid:", result@is_valid, "\n")
cat("Hard violations:", result@n_hard, "\n")
cat("Soft warnings:", result@n_soft, "\n")

# Individual risks
for (risk in result@risks) {
  cat("\n", risk$type, "(", risk$severity, "):\n", sep = "")
  cat("  ", risk$description, "\n")
  if (!is.null(risk$affected)) {
    cat("  Affected:", head(risk$affected, 5), "...\n")
  }
}

# Tabular format
as.data.frame(result)

See Also



Try the BORG package in your browser

Any scripts or data that you put into this service are public.

BORG documentation built on March 20, 2026, 5:09 p.m.