View source: R/simulate_suite.R
| simulate_leakage_suite | R Documentation |
Simulates synthetic binary classification datasets with optional leakage mechanisms, fits a model using a leakage-aware cross-validation scheme, and summarizes the permutation-gap audit for each Monte Carlo seed. The suite is designed to surface validation failures such as subject overlap across folds, batch-confounded outcomes, global normalization/summary leakage, and time-series look-ahead. The output is a per-seed summary of observed CV performance and its gap versus a label-permutation null; it does not return fitted models or the full audit object. Results are limited to the built-in data generator and leakage types implemented here, and should be interpreted as a simulation-based sanity check rather than a comprehensive leakage detector for real data.
simulate_leakage_suite(
n = 500,
p = 20,
prevalence = 0.5,
mode = c("subject_grouped", "batch_blocked", "study_loocv", "time_series"),
learner = c("glmnet", "ranger"),
leakage = c("none", "subject_overlap", "batch_confounded", "peek_norm", "lookahead"),
preprocess = NULL,
rho = 0,
K = 5,
repeats = 1,
horizon = 0,
B = 200,
seeds = 1:10,
parallel = FALSE,
signal_strength = 1,
verbose = FALSE
)
n |
Integer scalar. Number of samples to simulate (default 500). Larger values stabilize the Monte Carlo summary but increase runtime. |
p |
Integer scalar. Number of baseline predictors before any leakage
feature is added (default 20). Increasing |
prevalence |
Numeric scalar in (0, 1). Target prevalence of class 1 in the simulated outcome (default 0.5). Changing this alters class imbalance and can affect AUC and the permutation gap. |
mode |
Character scalar. Cross-validation scheme passed to
|
learner |
Character scalar. Base learner, |
leakage |
Character scalar. Leakage mechanism to inject; one of
|
preprocess |
Optional preprocessing list or recipe passed to
[fit_resample()]. When NULL (default), the simulator uses the
fit_resample defaults; for |
rho |
Numeric scalar in [-1, 1]. AR(1)-style autocorrelation applied to each predictor across row order (default 0). Higher absolute values increase serial correlation and make time-ordered leakage more pronounced. |
K |
Integer scalar. Number of folds/partitions (default 5). Used as the
fold count for |
repeats |
Integer scalar >= 1. Number of repeated CV runs for
|
horizon |
Numeric scalar >= 0. Minimum time gap enforced between train
and test for |
B |
Integer scalar >= 1. Number of permutations used by
|
seeds |
Integer vector. Monte Carlo seeds (default |
parallel |
Logical scalar. If |
signal_strength |
Numeric scalar. Scales the linear predictor before sampling outcomes (default 1). Larger values increase class separation and tend to increase AUC; smaller values make the task harder. |
verbose |
Logical scalar. If |
The generator draws p standard normal predictors, builds a linear
predictor from the first min(5, p) features, scales it by
signal_strength, and samples a binary outcome to achieve the requested
prevalence. Outcomes are returned as a two-level factor, so the audited
metric is AUC. Simulated metadata include subject, batch, study, and time
fields used by mode to create leakage-aware splits. Leakage mechanisms
are injected by adding a single extra predictor as described in
leakage. Parallel execution uses future.apply when installed and
does not change results.
A LeakSimResults data frame with one row per seed and columns:
seed: seed used for data generation, splitting, and auditing.
metric_obs: observed CV performance (AUC for this simulation).
gap: permutation-gap statistic (observed minus permutation mean).
p_value: permutation p-value for the gap.
leakage: leakage scenario used.
mode: CV mode used.
Only the permutation-gap summary is returned; fitted models, predictions, and other audit components are not included.
if (requireNamespace("glmnet", quietly = TRUE)) {
set.seed(1)
res <- simulate_leakage_suite(
n = 120, p = 6, prevalence = 0.4,
mode = "subject_grouped",
learner = "glmnet",
leakage = "subject_overlap",
K = 3, repeats = 1,
B = 50, seeds = 1,
parallel = FALSE
)
# One row per seed with observed AUC, permutation gap, and p-value
res
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.