knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, eval = TRUE )
bioLeak is a leakage-aware modeling toolkit for biomedical and machine-learning analyses. Its purpose is to prevent and diagnose information leakage across resampling workflows where training and evaluation data are not truly independent because samples share subjects, batches, studies, or time.
Standard workflows are often insufficient. Random, row-wise cross-validation assumes samples are independent. Global preprocessing (imputation, scaling, feature selection) done before resampling lets test-fold information shape the training process. These choices inflate performance and can lead to incorrect biomarker discovery, misleading clinical signals, or models that fail in deployment.
Data leakage means any direct or indirect use of evaluation data in training or feature engineering. In biomedical data, leakage commonly appears as:
bioLeak addresses these issues with leakage-aware splitting, guarded preprocessing that is fit only on training data, and audit diagnostics that surface overlaps, confounding, and duplicates.
The sections below walk through a leakage-aware workflow from data setup to audits. Each step includes a leaky failure mode and a corrected alternative.
library(bioLeak) set.seed(123) n <- 160 subject <- rep(seq_len(40), each = 4) batch <- sample(paste0("B", 1:6), n, replace = TRUE) study <- sample(paste0("S", 1:4), n, replace = TRUE) time <- seq_len(n) x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) linpred <- 0.7 * x1 - 0.4 * x2 + 0.2 * x3 + rnorm(n, sd = 0.5) p <- stats::plogis(linpred) outcome <- factor(ifelse(runif(n) < p, "case", "control"), levels = c("control", "case")) df <- data.frame( subject = subject, batch = batch, study = study, time = time, outcome = outcome, x1 = x1, x2 = x2, x3 = x3 ) df_leaky <- within(df, { leak_subject <- ave(as.numeric(outcome == "case"), subject, FUN = mean) leak_batch <- ave(as.numeric(outcome == "case"), batch, FUN = mean) leak_global <- mean(as.numeric(outcome == "case")) }) df_time <- df df_time$leak_future <- c(as.numeric(df_time$outcome == "case")[-1], 0) predictors <- c("x1", "x2", "x3") # Example data (first 6 rows) head(df) # Outcome class counts as.data.frame(table(df$outcome))
The table preview displays the metadata columns (subject, batch, study, time), the binary outcome, and three numeric predictors (x1, x2, x3).
The class count table shows the baseline prevalence. Stratified splits try to preserve these proportions (for grouped modes, stratification is applied at the group level).
make_split_plan()make_split_plan() is the foundation. It returns a LeakSplits object with
explicit train/test indices (or compact fold assignments) and metadata. For
data.frame inputs, a unique row_id column is added automatically, so
group = "row_id" is the explicit way to request sample-wise CV. It assumes
that the grouping columns you provide are complete and that samples sharing a
group must not cross folds. Grouped stratification uses the majority class per
group and is ignored for study_loocv, time_series, and survival outcomes.
Misuse to avoid:
group = "row_id" (sample-wise CV) when subjects repeattime_series will always return exactly v foldsLeaky example: row-wise CV when subjects repeat
leaky_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "row_id", v = 5, repeats = 1, stratify = TRUE ) cat("Leaky splits summary (sample-wise CV):\n") leaky_splits
The printed LeakSplits summary reports the split mode, number of folds, and fold-level training and test set sizes.
Because group = "row_id", each sample is treated as its own group. As a result, repeated samples from the same subject may appear in both the training and test sets, which introduces information leakage.
Leakage-safe alternative: group by subject
safe_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5, repeats = 1, stratify = TRUE, seed = 10 ) cat("Leakage-safe splits summary (subject-grouped CV):\n") safe_splits
Here, each subject is confined to a single fold. Test set sizes are multiples of the number of samples per subject, confirming that subjects were not split across folds.
The fold sizes remain comparable because stratify = TRUE balances outcome
proportions across folds while respecting subject-level boundaries.
Other leakage-aware modes
batch_splits <- make_split_plan( df, outcome = "outcome", mode = "batch_blocked", batch = "batch", v = 4, stratify = TRUE ) cat("Batch-blocked splits summary:\n") batch_splits study_splits <- make_split_plan( df, outcome = "outcome", mode = "study_loocv", study = "study" ) cat("Study leave-one-out splits summary:\n") study_splits time_splits <- make_split_plan( df, outcome = "outcome", mode = "time_series", time = "time", v = 4, horizon = 2 ) cat("Time-series splits summary:\n") time_splits nested_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, nested = TRUE, stratify = TRUE ) cat("Nested CV splits summary:\n") nested_splits
RNG seed lineage
bioLeak uses offset-based seeding so that every sub-operation gets a deterministic, distinct seed without requiring isolated RNG streams:
| Function | Base seed | Per-unit offset |
|---|---|---|
| make_split_plan() | set.seed(seed) | repeats: seed + 1000 * r; nested: seed + 1 |
| fit_resample() | set.seed(seed) | per-fold: seed + fold_id |
| audit_leakage() | set.seed(seed) | per-permutation: seed + b |
These are not isolated RNG streams but simple offsets. Collisions do not
occur for practical parameter values (e.g., repeats < 1000, v < 1000).
Always set an explicit seed argument for reproducibility; strict mode warns
when seed is NULL or NA.
Each summary reports the number of folds and the fold sizes for the corresponding split strategy.
Batch & Study: The test set sizes vary because batches/studies are not evenly sized. study_loocv uses one study per fold, so v and repeats are ignored.
Time series: The training set size increases across folds while the test set size follows the block size. Early blocks with no prior training data (or too-small test windows) are skipped, so you may see fewer than v folds.
Combined (N-axis) mode
When leakage can occur along multiple axes simultaneously (for example, both
subject and batch), mode = "combined" handles multi-axis constraint splitting.
The first element in constraints is the primary fold driver (determines
which groups go to test); subsequent elements are exclusion constraints
(training samples sharing constraint-axis levels with the test set are removed).
# For combined mode, constraint-axis levels should not span all folds. # Here each subject belongs to exactly one site, so site exclusion only # removes training samples from the same site as the test subjects. df_comb <- df df_comb$site <- paste0("site", rep(1:8, each = 5)[df_comb$subject]) combined_splits <- make_split_plan( df_comb, outcome = "outcome", mode = "combined", constraints = list( list(type = "subject", col = "subject"), list(type = "batch", col = "site") ), v = 4, stratify = TRUE, seed = 42 ) cat("Combined (N-axis) splits summary:\n") combined_splits
The fold sizes may be smaller than single-axis modes because training samples
are excluded whenever their site (or any constraint-axis level) also appears in
the test set. You can add more than two axes by appending additional elements to
constraints.
The legacy primary_axis/secondary_axis parameters are still accepted for
backward compatibility but constraints supersedes them and supports arbitrary
numbers of axes.
# Verify zero overlap on all constraint axes overlap_result <- check_split_overlap(combined_splits) overlap_result
Assumptions and intent by mode
subject_grouped: keeps all samples from a subject togetherbatch_blocked: keeps batches/centers together (leave-one-group-out when v
is at least the number of batches; otherwise multiple batches per fold)study_loocv: holds out each study in turn for external validationtime_series: rolling-origin splits with an optional horizon to prevent lookaheadcombined: primary axis drives fold assignment; all additional constraint axes
are enforced as exclusion constraints on the training setUse repeats for repeated CV in grouped/batch modes (it is ignored for
study_loocv and time_series), stratify = TRUE to balance outcome
proportions at the group level when applicable, and nested = TRUE to attach
inner folds (one repeat). For large datasets, progress = TRUE reports
progress; storing explicit indices can be memory intensive.
check_split_overlap()check_split_overlap() verifies that no grouping-column levels appear in both
the training and test sets of any fold. It returns a data.frame with one row per
fold-column combination showing the overlap count and a pass/fail flag.
# Run on the subject-grouped splits from earlier overlap_safe <- check_split_overlap(safe_splits) overlap_safe
When cols is not supplied, the function auto-infers the relevant columns from
the split mode (e.g., subject for subject_grouped, all constraint axes for
combined). By default stop_on_fail = TRUE, so the function raises a
bioLeak_overlap_error when any overlap is detected. Set stop_on_fail = FALSE
to return the results without stopping.
When strict mode is active (options(bioLeak.strict = TRUE)),
make_split_plan() automatically runs check_split_overlap() after building
non-compact splits, so you do not need to call it manually.
Handling Large Datasets (Compact Mode)
For large datasets (e.g., $N > 50,000$) with many repeats, storing explicit
integer indices for every fold can consume gigabytes of memory. Use compact = TRUE
to store a lightweight vector of fold assignments instead. fit_resample() will
automatically reconstruct the indices on the fly.
Compact mode is not supported when nested = TRUE (it falls back to full
indices). For time_series compact splits, the time column must be present in
the stored split metadata so folds can be reconstructed.
# Efficient storage for large N big_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5, compact = TRUE # <--- Saves memory ) cat("Compact-mode splits summary:\n") big_splits cat(sprintf("Compact storage enabled: %s\n", big_splits@info$compact))
The summary is identical to a regular split, but the underlying storage is a
compact fold-assignment vector. Use the compact flag to confirm the memory-
saving mode is active.
Setting options(bioLeak.strict = TRUE) activates a global safety switch that
tightens validation across the entire workflow. In strict mode:
Trained recipe/workflow detection escalates to an error. Normally,
passing a pre-trained recipe or workflow to fit_resample() or
tune_resample() produces a warning. In strict mode, it raises an error so
fold-global leakage cannot proceed silently.
Seed warnings. When seed is NULL or NA in make_split_plan(),
fit_resample(), or tune_resample(), strict mode emits a
bioLeak_validation_warning reminding you to set an explicit seed for
reproducibility.
Combined + nested warning. Using nested = TRUE with
mode = "combined" may produce empty inner folds; strict mode warns about
this before proceeding.
Automatic overlap check. After building non-compact splits,
make_split_plan() auto-runs check_split_overlap() to verify that no
grouping-column levels cross fold boundaries.
# Enable strict mode temporarily withr::with_options(list(bioLeak.strict = TRUE), { strict_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5, stratify = TRUE, seed = 42 ) cat("Strict mode splits completed without error.\n") })
# Strict mode catches a pre-trained recipe if (requireNamespace("recipes", quietly = TRUE)) { rec <- recipes::recipe(outcome ~ ., data = df[, c("outcome", predictors)]) |> recipes::step_normalize(recipes::all_numeric_predictors()) |> recipes::prep(training = df[, c("outcome", predictors)]) withr::with_options(list(bioLeak.strict = TRUE), { tryCatch( fit_resample( df, outcome = "outcome", splits = safe_splits, learner = "glmnet", preprocess = rec ), error = function(e) cat("Strict mode error:", conditionMessage(e), "\n") ) }) }
Strict mode is off by default. Use withr::with_options() for temporary
activation or set options(bioLeak.strict = TRUE) in your .Rprofile for
project-wide enforcement.
bioLeak uses guarded preprocessing to prevent leakage from global imputation,
scaling, and feature selection. The low-level API is:
.guard_fit() to fit preprocessing on training data onlypredict_guard() to apply the trained guard to new data.guard_ensure_levels() to align factor levels across train/testimpute_guarded() as a convenience wrapper for leakage-safe imputation.guard_fit() fits a pipeline that can winsorize, impute, normalize, filter,
and select features. It one-hot encodes non-numeric columns and carries factor
levels forward to new data. Assumptions and misuse to avoid:
fs = "ttest" requires a binary outcome with enough samples per classfs = "lasso" requires glmnetimpute = "mice" is not supported for guarded predictionimpute = "none" with missing values will trigger a median fallback and
missingness indicators to avoid errorsSupported preprocessing options include imputation (median, knn,
missForest, none), normalization (zscore, robust, none), filtering by
variance or IQR (optionally min_keep), and feature selection (ttest,
lasso, pca). Winsorization is controlled via impute$winsor and
impute$winsor_k in guarded steps; in impute_guarded(), use winsor and
winsor_thresh. impute_guarded() returns a LeakImpute object with guarded
data, diagnostics, and the fitted guard state.
Leaky example: global scaling before CV
df_leaky_scaled <- df df_leaky_scaled[predictors] <- scale(df_leaky_scaled[predictors]) scaled_summary <- data.frame( feature = predictors, mean = colMeans(df_leaky_scaled[predictors]), sd = apply(df_leaky_scaled[predictors], 2, stats::sd) ) scaled_summary$mean <- round(scaled_summary$mean, 3) scaled_summary$sd <- round(scaled_summary$sd, 3) # Leaky global scaling: means ~0 and SDs ~1 computed on all samples scaled_summary
The summary shows that scaling used the full dataset, so test-fold statistics influenced the training transformation. This violates the train/test separation and can bias performance estimates.
Leakage-safe alternative: fit preprocessing on training only
fold1 <- safe_splits@indices[[1]] train_x <- df[fold1$train, predictors] test_x <- df[fold1$test, predictors] guard <- .guard_fit( X = train_x, y = df$outcome[fold1$train], steps = list( impute = list(method = "median", winsor = TRUE), normalize = list(method = "zscore"), filter = list(var_thresh = 0, iqr_thresh = 0), fs = list(method = "none") ), task = "binomial" ) train_x_guarded <- predict_guard(guard, train_x) test_x_guarded <- predict_guard(guard, test_x) cat("GuardFit object:\n") guard cat("\nGuardFit summary:\n") summary(guard) # Guarded training data (first 6 rows) head(train_x_guarded) # Guarded test data (first 6 rows) head(test_x_guarded)
The GuardFit object records the preprocessing steps and the number of
features retained after filtering. The summary reports how many features were
removed and the preprocessing audit trail. The guarded train/test previews show
that missing values were imputed and (if requested) scaled using training-only
statistics; the test data never influences these values.
Factor level guard (advanced)
raw_levels <- data.frame( site = c("A", "B", "B"), status = c("yes", "no", "yes"), stringsAsFactors = FALSE ) level_state <- .guard_ensure_levels(raw_levels) # Aligned factor data with consistent levels level_state$data # Levels map level_state$levels
The returned data keeps factor levels consistent across folds, while the
levels list records the training-time levels (including any dummy levels added
to preserve one-hot columns). Use these when transforming new data to avoid
misaligned model matrices.
Leaky example: imputation using train and test together
train <- data.frame(a = c(1, 2, NA, 4), b = c(NA, 1, 1, 0)) test <- data.frame(a = c(NA, 5), b = c(1, NA)) all_median <- vapply(rbind(train, test), function(col) median(col, na.rm = TRUE), numeric(1)) train_leaky <- as.data.frame(Map(function(col, m) { col[is.na(col)] <- m; col }, train, all_median)) test_leaky <- as.data.frame(Map(function(col, m) { col[is.na(col)] <- m; col }, test, all_median)) # Leaky medians computed on train + test data.frame(feature = names(all_median), median = all_median) # Leaky-imputed training data train_leaky # Leaky-imputed test data test_leaky
The medians above were computed using both train and test data, so the test set influences the imputation values. This is a classic leakage pathway because test information directly alters the training features.
Leakage-safe alternative: impute_guarded()
imp <- impute_guarded( train = train, test = test, method = "median", winsor = FALSE ) # Guarded-imputed training data imp$train # Guarded-imputed test data imp$test
Here, the imputation statistics are computed from the training set only. The missing value in the test set (column a) is replaced by 2, which is the training median. In contrast, in the leaky example above, it was replaced by 3, the global median.
This confirms that the test set is transformed using fixed values learned from the training data, preserving a clean separation between training and evaluation.
The LeakImpute object also contains missingness diagnostics in
imp$summary$diagnostics, and guarded outputs use the same encoding as
.guard_fit() (categorical variables are one-hot encoded). Use vars to
impute only a subset of columns when needed.
Bridging guard steps to recipes: guard_to_recipe()
guard_to_recipe() converts a guard preprocessing specification into a
recipes::recipe, enabling a smooth transition from guard-based to
tidymodels-based workflows. It maps supported guard steps to their closest
recipe equivalents:
| Guard step | Recipe equivalent |
|---|---|
| impute$method = "median" | step_impute_median() |
| impute$method = "knn" | step_impute_knn() |
| normalize$method = "zscore" | step_normalize() |
| filter$var_thresh > 0 | step_nzv() |
| fs$method = "pca" | step_pca() |
Steps with no direct recipe equivalent (missForest, mice, robust
normalization, ttest, lasso feature selection) produce a
bioLeak_fallback_warning and either fall back to the closest available step or
are skipped.
if (requireNamespace("recipes", quietly = TRUE)) { guard_steps <- list( impute = list(method = "median"), normalize = list(method = "zscore") ) rec <- guard_to_recipe( steps = guard_steps, formula = outcome ~ ., training_data = df[, c("outcome", predictors)] ) cat("Converted recipe:\n") rec }
The returned recipe is untrained and can be passed directly to fit_resample()
or tune_resample() as the preprocess argument.
fit_resample()fit_resample() combines leakage-aware splits with guarded preprocessing and
model fitting. It supports:
glmnet, ranger)parsnip model specifications or workflows::workflow objects (recommended)recipes::recipe preprocessing (prepped on training folds only)rsample resamples (rset/rsplit) via splits, with split_cols to map metadatacustom_learners (legacy/advanced)fit@info$fold_status (success/skipped/failed)Assumptions and misuse to avoid:
Surv column or time/event columns)positive_class must be a single value that exists in the outcome levelsclass_weights only applies to binomial/multiclass tasks; workflows must
handle weights internallyUse learner_args to pass model-specific arguments. For custom learners
(advanced), you can supply separate fit and predict argument lists. Set
parallel = TRUE to use future.apply for fold-level parallelism when available.
When learner is a parsnip specification, learner_args and
custom_learners are ignored. When learner is a workflow, preprocess and
learner_args are ignored.
When a recipe or workflow is used, the built-in guarded preprocessing list is
bypassed, so ensure the recipe/workflow itself is leakage-safe.
Metrics used in this vignette:
Always report the mean and variability across folds, not a single fold value.
When using yardstick metrics, the positive class is the second factor level;
set positive_class to control this.
Parsnip model specification (recommended)
spec <- parsnip::logistic_reg(mode = "classification") |> parsnip::set_engine("glm")
This spec uses base R glm under the hood and does not require extra model
packages. Use it in the examples below; custom learners are covered in the
Advanced section.
Leaky example: leaky features and row-wise splits
fit_leaky <- fit_resample( df_leaky, outcome = "outcome", splits = leaky_splits, learner = spec, metrics = c("auc", "accuracy"), preprocess = list( impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0), fs = list(method = "none") ) ) cat("Leaky fit summary:\n") summary(fit_leaky) metrics_leaky <- as.data.frame(fit_leaky@metric_summary) num_cols <- vapply(metrics_leaky, is.numeric, logical(1)) metrics_leaky[num_cols] <- lapply(metrics_leaky[num_cols], round, digits = 3) # Leaky fit: mean and SD of metrics across folds metrics_leaky
The summary reports cross-validated performance. AUC ranges from 0.5 (random) to 1.0 (perfect), while accuracy is the fraction of correct predictions. Because the splits are leaky, these metrics can be artificially high and should not be used for reporting.
Leakage-safe alternative: grouped splits and clean predictors
fit_safe <- fit_resample( df, outcome = "outcome", splits = safe_splits, learner = spec, metrics = c("auc", "accuracy"), preprocess = list( impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0), fs = list(method = "none") ), positive_class = "case", class_weights = c(control = 1, case = 1), refit = TRUE ) cat("Leakage-safe fit summary:\n") summary(fit_safe) metrics_safe <- as.data.frame(fit_safe@metric_summary) num_cols <- vapply(metrics_safe, is.numeric, logical(1)) metrics_safe[num_cols] <- lapply(metrics_safe[num_cols], round, digits = 3) # Leakage-safe fit: mean and SD of metrics across folds metrics_safe # Per-fold metrics (first 6 rows) head(fit_safe@metrics)
fit_resample() returns a LeakFit object. Use summary(fit_safe) for a
compact report and inspect fit_safe@metrics and fit_safe@predictions for
details. When refit = TRUE, the final model and preprocessing state are stored
in fit_safe@info$final_model and fit_safe@info$final_preprocess.
Fold-level status diagnostics are available in fit_safe@info$fold_status.
The mean +/- SD table is the primary performance summary to report, while the per-fold metrics reveal variability and potential instability across folds.
By default, fit_resample() stores refit inputs in fit_safe@info$perm_refit_spec
to enable audit_leakage(perm_refit = "auto"). Set store_refit_data = FALSE
to reduce memory and provide perm_refit_spec manually when you want refit-based
permutations. When multiple learners are passed, refit = TRUE refits only the
first learner; set refit = FALSE if you do not need a final model.
# Fold-level diagnostics for reproducible troubleshooting head(fit_safe@info$fold_status, 5)
Interpretation of fold_status:
status = "success" means at least one learner produced valid metrics on that fold.status = "skipped" typically means a data-structure issue in that fold (for example, single-class training data in classification).status = "failed" indicates a runtime failure (for example, model fitting or preprocessing error).Use reason and notes to decide whether to:
If many folds are skipped/failed, do not trust aggregate metric means until the instability is resolved.
The examples above use a parsnip model specification. To swap in other models,
replace spec with another parsnip spec (see the gradient boosting example
below).
Multiclass classification (optional)
if (requireNamespace("ranger", quietly = TRUE)) { set.seed(11) df_multi <- df df_multi$outcome3 <- factor(sample(c("A", "B", "C"), nrow(df_multi), replace = TRUE)) multi_splits <- make_split_plan( df_multi, outcome = "outcome3", mode = "subject_grouped", group = "subject", v = 4, stratify = TRUE, seed = 11 ) fit_multi <- fit_resample( df_multi, outcome = "outcome3", splits = multi_splits, learner = "ranger", metrics = c("accuracy", "macro_f1", "log_loss"), refit = FALSE ) cat("Multiclass fit summary:\n") summary(fit_multi) } else { cat("ranger not installed; skipping multiclass example.\n") }
Multiclass fits compute accuracy, macro-F1, and log loss when class
probabilities are available. positive_class is ignored for multiclass tasks.
Survival outcomes (optional)
if (requireNamespace("survival", quietly = TRUE) && requireNamespace("glmnet", quietly = TRUE)) { set.seed(12) df_surv <- df df_surv$time_to_event <- rexp(nrow(df_surv), rate = 0.1) df_surv$event <- rbinom(nrow(df_surv), 1, 0.7) surv_splits <- make_split_plan( df_surv, outcome = c("time_to_event", "event"), mode = "subject_grouped", group = "subject", v = 4, stratify = FALSE, seed = 12 ) fit_surv <- fit_resample( df_surv, outcome = c("time_to_event", "event"), splits = surv_splits, learner = "glmnet", metrics = "cindex", refit = FALSE ) cat("Survival fit summary:\n") summary(fit_surv) } else { cat("survival or glmnet not installed; skipping survival example.\n") }
For survival tasks, supply outcome = c(time_col, event_col) or a Surv
column; stratify is ignored for time/event outcomes and class_weights are
not used.
Passing learner-specific arguments (optional)
if (requireNamespace("glmnet", quietly = TRUE)) { fit_glmnet <- fit_resample( df, outcome = "outcome", splits = safe_splits, learner = "glmnet", metrics = "auc", learner_args = list(glmnet = list(alpha = 0.5)), preprocess = list( impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0), fs = list(method = "none") ) ) cat("GLMNET summary with learner-specific arguments:\n") summary(fit_glmnet) } else { cat("glmnet not installed; skipping learner_args example.\n") }
This summary reflects the same guarded preprocessing but a different model
configuration (here, alpha = 0.5). Use the mean +/- SD metrics to compare
learner settings under identical splits.
SummarizedExperiment inputs (optional)
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { se <- SummarizedExperiment::SummarizedExperiment( assays = list(counts = t(as.matrix(df[, predictors]))), colData = df[, c("subject", "batch", "study", "time", "outcome"), drop = FALSE] ) se_splits <- make_split_plan( se, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3 ) se_fit <- fit_resample( se, outcome = "outcome", splits = se_splits, learner = spec, metrics = "auc" ) cat("SummarizedExperiment fit summary:\n") summary(se_fit) } else { cat("SummarizedExperiment not installed; skipping SE example.\n") }
The summary is identical in structure to the data.frame case because fit_resample() extracts predictors and metadata from the SummarizedExperiment object.
Note that features_final here reflects only the assay predictors (x1, x2,
x3), because the assay was constructed without metadata columns.
bioLeak integrates with rsample, recipes, workflows, and yardstick:
splits can be an rsample rset/rsplit; use split_cols (default "auto")
or bioLeak_mode / bioLeak_perm_mode attributes to map
group/batch/study/time metadata.preprocess can be a recipes::recipe (prepped on training folds only).learner can be a workflows::workflow.metrics can be a yardstick::metric_set.as_rsample() converts LeakSplits to an rsample object.library(bioLeak) library(parsnip) library(recipes) library(yardstick) set.seed(123) N <- 60 df <- data.frame( subject = factor(rep(paste0("S", 1:20), length.out = N)), outcome = factor(rep(c("ClassA", "ClassB"), length.out = N)), x1 = rnorm(N), x2 = rnorm(N), x3 = rnorm(N) ) spec <- logistic_reg() |> set_engine("glm") # Use bioLeak's native split planner to avoid conversion errors set.seed(13) # Use make_split_plan instead of rsample::group_vfold_cv # This creates a subject-grouped CV directly compatible with fit_resample splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3 ) rec <- recipes::recipe(outcome ~ x1 + x2 + x3, data = df) |> recipes::step_impute_median(recipes::all_numeric_predictors()) |> recipes::step_normalize(recipes::all_numeric_predictors()) metrics_set <- yardstick::metric_set(yardstick::roc_auc, yardstick::accuracy) fit_rs <- fit_resample( df, outcome = "outcome", splits = splits, learner = spec, preprocess = rec, metrics = metrics_set, refit = FALSE ) if (exists("as_rsample", where = asNamespace("bioLeak"), mode = "function")) { rs_export <- as_rsample(fit_rs@splits, data = df) print(rs_export) }
Use split_cols to ensure split-defining metadata are dropped from predictors
when using rsample inputs.
if (requireNamespace("workflows", quietly = TRUE)) { wf <- workflows::workflow() |> workflows::add_model(spec) |> workflows::add_formula(outcome ~ x1 + x2 + x3) fit_wf <- fit_resample( df, outcome = "outcome", splits = safe_splits, learner = wf, metrics = "auc", refit = FALSE ) cat("Workflow fit summary:\n") summary(fit_wf) } else { cat("workflows not installed; skipping workflow example.\n") }
When auditing rsample-based fits, pass perm_mode = "subject_grouped" (or set
attr(rs, "bioLeak_perm_mode") <- "subject_grouped") so restricted permutations
respect the intended split design.
tune_resample()tune_resample() runs leakage-aware nested tuning with tidymodels. It accepts a
parsnip model specification or workflow, and supports two inner-CV modes:
make_split_plan(..., nested = TRUE))LeakSplits using inner_v,
inner_repeats, and inner_seedFor rsample inputs, inner folds must already be present. Survival tasks are not yet supported.
New in the development version (planned for v0.2.0):
selection = "one_std_err" with an explicit simplicity tie-breaktuned$fold_statusrefit = TRUEif (requireNamespace("parsnip", quietly = TRUE) && requireNamespace("recipes", quietly = TRUE) && requireNamespace("tune", quietly = TRUE) && requireNamespace("glmnet", quietly = TRUE)) { # --- 1. Create Data --- set.seed(123) N <- 60 df <- data.frame( subject = factor(rep(paste0("S", 1:20), length.out = N)), outcome = factor(sample(c("ClassA", "ClassB"), N, replace = TRUE)), x1 = rnorm(N), x2 = rnorm(N), x3 = rnorm(N) ) # --- 2. Generate Nested Splits --- set.seed(1) nested_splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, nested = TRUE ) # --- 3. Define Recipe & Model --- rec <- recipes::recipe(outcome ~ x1 + x2 + x3, data = df) |> recipes::step_impute_median(recipes::all_numeric_predictors()) |> recipes::step_normalize(recipes::all_numeric_predictors()) spec_tune <- parsnip::logistic_reg( penalty = tune::tune(), mixture = 1, mode = "classification" ) |> parsnip::set_engine("glmnet") # --- 4. Run Tuning --- tuned <- tune_resample( df, outcome = "outcome", splits = nested_splits, learner = spec_tune, preprocess = rec, inner_v = 2, grid = 3, metrics = c("auc", "accuracy"), selection = "one_std_err", refit = TRUE, seed = 14 ) summary(tuned) } else { cat("parsnip/recipes/tune/glmnet not installed; skipping nested tuning example.\n") }
if (exists("tuned")) { # Fold-level status from the outer loop head(tuned$fold_status, 5) # Final tuned model is available when refit = TRUE is.null(tuned$final_model) } else { cat("Nested tuning object not available (dependencies missing).\n") }
How to interpret this tuning output:
selection = "one_std_err" chooses a model within one standard error of the best inner-CV score, then applies a deterministic simplicity tie-break. This usually favors more regularized/simpler models with similar expected performance.tuned$metrics and tuned$metric_summary remain the primary unbiased performance estimates (outer folds only).tuned$final_model is the deployable model fit on full data using the selected hyperparameters; it is for downstream prediction, not for estimating generalization.tuned$fold_status should be reviewed before reporting results. A large number of skipped/failed folds indicates unstable tuning and should trigger split/parameter redesign.bioLeak natively supports tidymodels specifications. You can pass a parsnip
model specification directly to fit_resample(). This allows you to use
state-of-the-art algorithms like XGBoost, LightGBM, or SVMs while ensuring all
preprocessing remains guarded.
if (requireNamespace("parsnip", quietly = TRUE) && requireNamespace("xgboost", quietly = TRUE) && requireNamespace("recipes", quietly = TRUE)) { # 1. Define the model spec xgb_spec <- parsnip::boost_tree( mode = "classification", trees = 100, tree_depth = 6, learn_rate = 0.01 ) |> parsnip::set_engine("xgboost") # 2. Define a recipe on data without 'subject' because split metadata # columns are excluded from predictors by fit_resample(). df_for_rec <- df[, !names(df) %in% "subject"] rec_xgb <- recipes::recipe(outcome ~ ., data = df_for_rec) |> recipes::step_dummy(recipes::all_nominal_predictors()) |> recipes::step_impute_median(recipes::all_numeric_predictors()) # Note: No need for step_rm(subject) because it's already gone! # 3. Fit fit_xgb <- fit_resample( df, outcome = "outcome", splits = nested_splits, learner = xgb_spec, metrics = "auc", preprocess = rec_xgb ) cat("XGBoost parsnip fit summary:\n") print(summary(fit_xgb)) } else { cat("parsnip/xgboost/recipes not installed.\n") }
The summary reports cross-validated AUC for a non-linear gradient boosting model. Use the mean $\pm$ SD table to compare against baseline models, and confirm that any gains do not coincide with leakage signals in the audit diagnostics.
Custom learners are used when a model is not available as a parsnip specification or when a lightweight wrapper around base R is needed.
Each custom learner must define fit and predict components. The fit function must accept x, y, task, and weights as inputs, while the predict function must accept object and newdata.
custom_learners <- list( glm = list( fit = function(x, y, task, weights, ...) { df_fit <- data.frame(y = y, x, check.names = FALSE) stats::glm(y ~ ., data = df_fit, family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) cat("Custom learner names:\n") names(custom_learners) cat("Custom learner components (fit/predict):\n") lapply(custom_learners, names)
This output confirms that each custom learner provides both a fit and a predict function.
Custom learners can be used with fit_resample() as follows:
fit_resample( ..., learner = "glm", custom_learners = custom_learners )
bioLeak includes plotting helpers for quick diagnostic checks:
plot_fold_balance() shows class balance per fold. plot_overlap_checks() highlights train/test overlap for a metadata column. plot_time_acf() checks autocorrelation in time-series predictions.plot_calibration() shows reliability of binomial probabilities. plot_confounder_sensitivity() summarizes performance by batch/study strata.Tabular helpers are also available:
calibration_summary() returns calibration curve data and ECE/MCE/Brier metrics.confounder_sensitivity() returns per-stratum performance summaries.Misuse to avoid
plot_overlap_checks() when the column is not present in the split metadata. plot_time_acf() without a time column or for non-temporal data.plot_calibration() outside binomial tasks.plot_confounder_sensitivity() with a metric unsupported by the task.For classification, plot_fold_balance() uses the positive class recorded in
fit@info$positive_class, or defaults to the second factor level if this is not
set. For multiclass tasks, it shows per-class counts without a proportion line.
if (requireNamespace("ggplot2", quietly = TRUE)) { plot_fold_balance(fit_safe) } else { cat("ggplot2 not installed; skipping fold balance plot.\n") }
The bar chart shows the counts of Positives (blue) and Negatives (tan) in each fold. The dashed blue line represents the proportion of the positive class.
In a well-stratified fit, this line remains relatively stable across folds. Large fluctuations indicate poor stratification, which can lead to unstable fold-level performance estimates.
if (requireNamespace("ggplot2", quietly = TRUE)) { plot_calibration(fit_safe, bins = 10) } else { cat("ggplot2 not installed; skipping calibration plot.\n") }
Calibration curves compare predicted probabilities to observed event rates. Large deviations from the diagonal indicate miscalibration.
Use min_bin_n to suppress tiny bins and learner = to select a model when
multiple learners are stored in the fit.
if (requireNamespace("ggplot2", quietly = TRUE)) { plot_confounder_sensitivity(fit_safe, confounders = c("batch", "study"), metric = "auc", min_n = 3) } else { cat("ggplot2 not installed; skipping confounder sensitivity plot.\n") }
Confounder sensitivity plots highlight whether performance varies substantially across batch or study strata.
cal <- calibration_summary(fit_safe, bins = 10, min_bin_n = 5) conf_tbl <- confounder_sensitivity(fit_safe, confounders = c("batch", "study"), metric = "auc", min_n = 3) # Calibration metrics cal$metrics # Confounder sensitivity table (first 6 rows) head(conf_tbl)
if (requireNamespace("ggplot2", quietly = TRUE)) { plot_overlap_checks(fit_leaky, column = "subject") plot_overlap_checks(fit_safe, column = "subject") } else { cat("ggplot2 not installed; skipping overlap plots.\n") }
The overlap plots compare the number of unique subjects appearing in the training and test sets for each fold.
Top plot (fit_leaky)
The red line shows high overlap counts, accompanied by the explicit “WARNING: Overlaps detected!” annotation. This confirms that the same subjects appear in both the training and test sets, indicating information leakage.
Bottom plot (fit_safe)
The overlap line remains flat at zero across all folds. This confirms that subject_grouped splitting successfully keeps subjects isolated, ensuring that no subject appears in both sets simultaneously.
audit_leakage()audit_leakage() combines four diagnostics in one object:
Assumptions and misuse to avoid:
coldata (or stored split metadata) must align with prediction IDs; rownames
or a row_id column help alignmentX_ref must align to prediction IDs (rownames, row_id, or matching order)plot_perm_distribution() requires return_perm = TRUEtarget_threshold should be set high enough to flag only strong proxiestarget_scan_multivariate, *_components,
*_interactions, *_B); proxies outside X_ref can still pass undetectedperm_mode (or bioLeak_perm_mode) so restricted
permutations respect the intended designtime_block /
block_len as neededInterpretation Note: The label-permutation test refits models by default when
refit inputs are available (perm_refit = "auto" with store_refit_data = TRUE)
and B <= perm_refit_auto_max. Otherwise it keeps predictions fixed, so the
p-value quantifies prediction-label association rather than a full refit null.
A large gap indicates a non-random association. To determine if that signal is
real or leaked, check the Batch Association (confounding),
Target Leakage Scan (proxy features), and Duplicate Detection
(memorization) tables. Use perm_refit = FALSE to force fixed-prediction
shuffles or perm_refit = TRUE with perm_refit_spec to always refit.
Use feature_space = "rank" to compare samples by rank profiles, and
sim_method (cosine or pearson) to control similarity. For large n, nn_k
and max_pairs limit duplicate searches. Use duplicate_scope = "all" to
include within-fold duplicates (data-quality checks). Use ci_method = "if"
(influence-function) or ci_method = "bootstrap" with boot_B to obtain a
confidence interval for the permutation gap; include_z controls whether a
z-score is reported. Set perm_stratify = "auto" to stratify permutations only
when outcomes support it.
Leakage-aware audit
X_ref <- df[, predictors] X_ref[c(1, 5), ] <- X_ref[1, ] audit <- audit_leakage( fit_safe, metric = "auc", B = 20, perm_stratify = TRUE, batch_cols = c("batch", "study"), X_ref = X_ref, sim_method = "cosine", sim_threshold = 0.995, return_perm = TRUE ) cat("Leakage audit summary:\n") summary(audit) if (!is.null(audit@permutation_gap) && nrow(audit@permutation_gap) > 0) { # Permutation significance results audit@permutation_gap } if (!is.null(audit@batch_assoc) && nrow(audit@batch_assoc) > 0) { # Batch/study association with folds (Cramer's V) audit@batch_assoc } else { cat("No batch or study associations detected.\n") } if (!is.null(audit@target_assoc) && nrow(audit@target_assoc) > 0) { # Top features by target association score head(audit@target_assoc) } else { cat("No target leakage scan results available.\n") } if (!is.null(audit@duplicates) && nrow(audit@duplicates) > 0) { # Top duplicate/near-duplicate pairs by similarity head(audit@duplicates) } else { cat("No near-duplicates detected.\n") }
The permutation table reports the observed metric, the mean under random label permutation, the gap (difference), and a permutation p-value. For metrics where higher values indicate better performance, larger gaps reflect stronger non-random signal.
The batch association table reports chi-square statistics and Cramer's V. Large p-values and small V values indicate that folds are not aligned with batch or study labels (which is the desired outcome when these should be independent).
The target scan table lists features with the strongest associations with the outcome. For numeric features, the score is (|\mathrm{AUC} - 0.5| \times 2) for classification or the absolute correlation for regression. For categorical features, the score is Cramér’s V or eta-squared. Scores closer to 1 indicate stronger outcome association.
The duplicate table lists pairs of samples with near-identical profiles that
cross train/test folds (by default). In this setup, the artificially duplicated
pair (X_ref[c(1, 5), ] <- X_ref[1, ]) should appear near the top of the list.
Use duplicate_scope = "all" to include within-fold duplicates.
Mechanism risk assessment
summary(audit) now includes a Mechanism Risk Assessment section that
classifies leakage evidence into four mechanism classes:
| Mechanism class | Evidence slot | Flagging rule |
|---|---|---|
| non_random_signal | permutation gap | p ≤ 0.05 and gap > 0 |
| confounding_alignment | batch association | p ≤ 0.05 and Cramér's V ≥ 0.1 |
| proxy_target_leakage | target scan | any feature flagged (FDR or raw) |
| duplicate_overlap | duplicates | any cross-fold duplicates present |
The summary is stored in audit@info$mechanism_summary and can be accessed
directly for programmatic triage:
mech <- audit@info$mechanism_summary if (is.data.frame(mech) && nrow(mech) > 0) { mech } else { cat("No mechanism summary available.\n") }
Each row reports whether the mechanism was flagged, the evidence type, and the associated test statistic and p-value. This provides a quick triage overview before drilling into individual audit slots.
if (requireNamespace("ggplot2", quietly = TRUE)) { plot_perm_distribution(audit) } else { cat("ggplot2 not installed; skipping permutation plot.\n") }
The histogram shows the null distribution (gray bars) of the performance metric under random label permutation.
The blue dashed line represents the average performance of a random model (the permuted mean). The red solid line represents the observed performance of the fitted model.
A genuine signal is indicated when the red line lies well to the right of the gray distribution; overlap suggests weak or unstable signal.
Use the mechanism risk assessment (audit@info$mechanism_summary) as a quick
triage tool: flagged mechanisms point you to the specific audit slots that
warrant deeper investigation.
Audit per learner
if (requireNamespace("ranger", quietly = TRUE) && requireNamespace("parsnip", quietly = TRUE)) { # 1. Define specs # Standard GLM (no tuning) spec_glm <- parsnip::logistic_reg() |> parsnip::set_engine("glm") # Random Forest spec_rf <- parsnip::rand_forest( mode = "classification", trees = 100 ) |> parsnip::set_engine("ranger") # 2. Fit using the current split object fit_multi <- fit_resample( df, outcome = "outcome", splits = nested_splits, learner = list(glm = spec_glm, rf = spec_rf), metrics = "auc" ) # 3. Run the audit audits <- audit_leakage_by_learner(fit_multi, metric = "auc", B = 20) cat("Per-learner audit summary:\n") print(audits) } else { cat("ranger/parsnip not installed.\n") }
This example uses ranger for the random forest specification; if it is not installed, the code chunk is skipped.
Use parallel_learners = TRUE to audit learners concurrently when future.apply is available.
The printed table summarizes each learner’s observed metric, permutation gap,
p-value, and key batch/duplicate summaries. Use it to compare signal strength
across models while checking for leakage risks. Pass learners = to audit a
subset, or mc.cores = to control parallel workers.
HTML audit report
audit_report() accepts either a LeakAudit or a LeakFit object. When a
LeakFit is provided, it first runs audit_leakage() and then forwards any
additional arguments to the audit step. If multiple learners were fit, pass
learner = via ... to select one. Use open = TRUE to open the report in a
browser after rendering.
if (requireNamespace("rmarkdown", quietly = TRUE) && rmarkdown::pandoc_available()) { report_path <- audit_report(audit, output_dir = ".") cat("HTML report written to:\n", report_path, "\n") } else { cat("rmarkdown or pandoc not available; skipping audit report rendering.\n") }
The report path points to a standalone HTML file containing the same audit tables and plots, suitable for sharing with collaborators or archiving as a quality control record.
Time-series data require special handling. Random splits can leak information
from the future into the past. Use mode = "time_series" with a prediction
horizon, and audit with block permutations. Choose time_block = "circular"
or "stationary"; when block_len is NULL, the audit uses a default block
length (~10% of the test block size, minimum 5).
Leaky example: lookahead feature
time_splits <- make_split_plan( df_time, outcome = "outcome", mode = "time_series", time = "time", v = 4, horizon = 1 ) cat("Time-series splits summary:\n") time_splits fit_time_leaky <- fit_resample( df_time, outcome = "outcome", splits = time_splits, learner = spec, metrics = "auc" ) cat("Time-series leaky fit summary:\n") summary(fit_time_leaky)
Time-series splitting trains on growing windows, so performance can differ from
standard CV simply because early folds have smaller training sets. Regardless of
the score, the presence of the leak_future feature makes this estimate
methodologically invalid.
Leakage-safe alternative: remove lookahead and audit with blocks
df_time_safe <- df_time df_time_safe$leak_future <- NULL fit_time_safe <- fit_resample( df_time_safe, outcome = "outcome", splits = time_splits, learner = spec, metrics = "auc" ) cat("Time-series safe fit summary:\n") summary(fit_time_safe) audit_time <- audit_leakage( fit_time_safe, metric = "auc", B = 20, time_block = "stationary", block_len = 5 ) cat("Time-series leakage audit summary:\n") summary(audit_time) if (!is.null(audit_time@permutation_gap) && nrow(audit_time@permutation_gap) > 0) { # Time-series permutation significance results audit_time@permutation_gap } if (requireNamespace("ggplot2", quietly = TRUE)) { plot_time_acf(fit_time_safe, lag.max = 20) } else { cat("ggplot2 not installed; skipping ACF plot.\n") }
The safe fit summary provides the leakage-resistant performance estimate. Compare
leaky vs. safe fits to gauge inflation risk; features_final should drop when
the lookahead feature is removed.
The ACF plot shows the autocorrelation of out-of-fold predictions by lag. The
dashed red lines represent the 95% confidence interval for white noise; large
bars outside the bands indicate residual temporal dependence. plot_time_acf()
requires numeric predictions and time metadata aligned to the fit.
bioLeak uses the future framework for parallelism (Windows, macOS, Linux).
fit_resample(), audit_leakage(), audit_leakage_by_learner(), and
simulate_leakage_suite() honor the active plan when parallel = TRUE (or
parallel_learners = TRUE).
library(future) # Use multiple cores (works on all OS) plan(multisession, workers = 4) # Run a heavy simulation sim <- simulate_leakage_suite(..., parallel = TRUE) # Parallel folds or audits # fit_resample(..., parallel = TRUE) # audit_leakage(..., parallel = TRUE) # audit_leakage_by_learner(..., parallel_learners = TRUE) # Return to sequential processing plan(sequential)
This chunk only configures the parallel plan, so it does not produce a result object. Use it as a template before running compute-heavy functions.
simulate_leakage_suite() runs Monte Carlo simulations that inject specific
leakage mechanisms and then evaluates detection with the audit pipeline. It is
useful for validating your leakage checks before applying them to real data.
Available leakage types include none, subject_overlap, batch_confounded,
peek_norm, and lookahead. Use signal_strength, prevalence, rho, K,
repeats, horizon, and preprocess to control the simulation. The output is
a LeakSimResults data frame with one row per seed. Metrics are selected
automatically based on outcome type (AUC for binary classification).
if (requireNamespace("glmnet", quietly = TRUE)) { sim <- simulate_leakage_suite( n = 80, p = 6, mode = "subject_grouped", learner = "glmnet", leakage = "subject_overlap", seeds = 1:2, B = 20, parallel = FALSE, signal_strength = 1 ) # Simulation results (first 6 rows) head(sim) } else { cat("glmnet not installed; skipping simulation suite example.\n") }
Each row corresponds to one simulation seed.
metric_obs is the observed performance (AUC for this simulation). Higher
values suggest stronger apparent signal, which can be inflated by leakage.
gap and p_value indicate that the leaked signal is statistically distinguishable from random noise.
Use metric_obs to gauge the magnitude of the leakage effect, and gap to assess detection sensitivity.
bioLeak uses S4 classes and list-like results to capture provenance and diagnostics:
LeakSplits: returned by make_split_plan(), printed with show().
Use check_split_overlap() to verify no-overlap invariants on grouping
columns.LeakFit: returned by fit_resample(), summarized with summary()
includes fold diagnostics in fit@info$fold_statusLeakAudit: returned by audit_leakage(), summarized with summary().
Now includes audit@info$mechanism_summary — a data.frame classifying
leakage evidence into mechanism classes (non_random_signal,
confounding_alignment, proxy_target_leakage, duplicate_overlap).LeakAuditList: returned by audit_leakage_by_learner(), printed directlyGuardFit: returned by .guard_fit(), printed and summarized.
Use guard_to_recipe() to convert guard preprocessing specs to a
recipes::recipe for tidymodels workflows.LeakImpute: returned by impute_guarded() (guarded data + diagnostics)LeakTune: returned by tune_resample() (nested tuning summaries)
includes fold_status, selected parameters, and optional final_model /
final_workflow when refit = TRUELeakSimResults: returned by simulate_leakage_suite() (per-seed results)Set options(bioLeak.strict = TRUE) to activate strict leakage mode, which
escalates trained-recipe/workflow warnings to errors, warns on missing seeds,
and auto-runs overlap checks after splitting.
These objects store hashes and metadata that help reproduce and audit the
workflow. Inspect their slots (@metrics, @predictions, @permutation_gap,
@duplicates) to drill into specific leakage signals.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.