cv_boost_imputed: Cross-validation for boosting after multiple imputation...
In booami: Component-Wise Gradient Boosting after Multiple Imputation

View source: R/cv_boost.R

cv_boost_imputed

R Documentation

Cross-validation for boosting after multiple imputation (pre-imputed inputs)

Description

To avoid data leakage, each CV fold should first be split into training and validation subsets, after which imputation is performed. For the final model, all data should be imputed independently.

Usage

cv_boost_imputed(
  X_train_list,
  y_train_list,
  X_val_list,
  y_val_list,
  X_full,
  y_full,
  ny = 0.1,
  mstop = 250,
  type = c("gaussian", "logistic"),
  MIBoost = TRUE,
  pool = TRUE,
  pool_threshold = 0,
  show_progress = TRUE,
  center = c("auto", "off", "force")
)

Arguments

`X_train_list`	A list of length `k`. Element `i` is itself a list of length `M` containing the `n_{train} \times p` numeric design matrices for each imputed dataset in CV fold `i`.
`y_train_list`	A list of length `k`. Element `i` is a list of length `M`, where each element is a length-`n_{train}` numeric response vector aligned with `X_train_list[[i]][[m]]`.
`X_val_list`	A list of length `k`. Element `i` is a list of length `M` containing the `n_{val} \times p` numeric validation matrices matched to the corresponding imputed training dataset in fold `i`.
`y_val_list`	A list of length `k`. Element `i` is a list of length `M`, where each element is a length-`n_{val}` continuous response vector aligned with `X_val_list[[i]][[m]]`.
`X_full`	A list of length `M` with the `n \times p` numeric full-data design matrices (one per imputed dataset) used to fit the final model.
`y_full`	A list of length `M`, where each element is a length-`n` continuous response vector corresponding to the imputed dataset in `X_full`.
`ny`	Learning rate. Defaults to `0.1`.
`mstop`	Maximum number of boosting iterations to evaluate during cross-validation. The selected `mstop` is the value that minimizes the mean CV error over `1:mstop`. Default is `250`.
`type`	Type of loss function. One of: `"gaussian"` (mean squared error) for continuous responses, or `"logistic"` (binomial deviance) for binary responses.
`MIBoost`	Logical. If `TRUE`, applies the MIBoost algorithm, which enforces uniform variable selection across all imputed datasets. If `FALSE`, variables are selected independently within each imputed dataset, and pooling is governed by `pool_threshold`.
`pool`	Logical. If `TRUE`, models across the `M` imputed datasets are aggregated into a single final model. If `FALSE`, `M` separate models are returned.
`pool_threshold`	Only used when `MIBoost = FALSE` and `pool = TRUE`. Controls the pooling rule when aggregating the `M` models obtained from the imputed datasets into a single final model. A candidate variable is included only if it is selected in at least `pool_threshold` (a value in (0, 1]) proportion of the imputed datasets; coefficients of all other variables are set to zero. A value of `0` corresponds to estimate-averaging, while values `> 0` correspond to selection-frequency thresholding.
`show_progress`	Logical; print fold-level progress and summary timings. Default `TRUE`.
`center`	One of `c("auto", "off", "force")`. Controls centering of `X`. With `"auto"` (recommended), centering is applied only if the training data are not already centered (checked across imputations). With `"force"`, centering is always applied. With `"off"`, centering is skipped. If centering is applied, a single grand mean vector `\mu_\star` is computed from the training imputations in the corresponding fold and then subtracted from all imputed training and validation matrices in that fold (and analogously for the final model fit on `X_full`).

Details

The recommended workflow is illustrated in the examples.

Centering affects only X; y is left unchanged. For type = "logistic", responses are treated as numeric 0/1 via the logistic link. Validation loss is averaged over imputations and then over folds.

Value

A list with:

CV_error: numeric vector of length mstop with the mean cross-validated loss across folds (and imputations).
best_mstop: integer index of the minimizing entry in CV_error.
final_model: numeric vector of length 1 + p containing the intercept followed by p coefficients of the final pooled model fitted at best_mstop on X_full/y_full.
center_means: (optional) numeric vector of length p containing the centering means used for X (when available).

References

Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.

Examples



  set.seed(123)
  utils::data(booami_sim)
  k <- 2; M <- 2

  # Separate X and y; drop missing y (policy)
  X_all <- booami_sim[, 1:25, drop = FALSE]
  y_all <- booami_sim[, 26]
  keep <- !is.na(y_all)
  X_all <- X_all[keep, , drop = FALSE]
  y_all <- y_all[keep]

  n <- nrow(X_all); p <- ncol(X_all)
  folds <- sample(rep(seq_len(k), length.out = n))

  X_train_list <- vector("list", k)
  y_train_list <- vector("list", k)
  X_val_list   <- vector("list", k)
  y_val_list   <- vector("list", k)

  for (cv in seq_len(k)) {
    tr <- folds != cv
    va <- !tr
    Xtr <- X_all[tr, , drop = FALSE]; ytr <- y_all[tr]
    Xva <- X_all[va, , drop = FALSE]; yva <- y_all[va]

    # Impute X only (y is never used for imputation)
    pm_tr  <- mice::quickpred(Xtr, method = "spearman", mincor = 0.30, minpuc = 0.60)
    imp_tr <- mice::mice(Xtr, m = M, predictorMatrix = pm_tr, maxit = 1, printFlag = FALSE)
    imp_va <- mice::mice.mids(imp_tr, newdata = Xva, maxit = 1, printFlag = FALSE)

    X_train_list[[cv]] <- vector("list", M)
    y_train_list[[cv]] <- vector("list", M)
    X_val_list[[cv]]   <- vector("list", M)
    y_val_list[[cv]]   <- vector("list", M)

    for (m in seq_len(M)) {
      tr_m <- mice::complete(imp_tr, m)
      va_m <- mice::complete(imp_va, m)
      X_train_list[[cv]][[m]] <- data.matrix(tr_m)
      y_train_list[[cv]][[m]] <- ytr
      X_val_list[[cv]][[m]]   <- data.matrix(va_m)
      y_val_list[[cv]][[m]]   <- yva
    }
  }

  # Full-data imputations (X only)
  pm_full  <- mice::quickpred(X_all, method = "spearman", mincor = 0.30, minpuc = 0.60)
  imp_full <- mice::mice(X_all, m = M, predictorMatrix = pm_full, maxit = 1, printFlag = FALSE)
  X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)))
  y_full <- lapply(seq_len(M), function(m) y_all)

  res <- cv_boost_imputed(
    X_train_list, y_train_list,
    X_val_list,   y_val_list,
    X_full,       y_full,
    ny = 0.1, mstop = 50, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto",
    show_progress = FALSE
  )
  



  set.seed(2025)
  utils::data(booami_sim)
  k <- 5; M <- 10

  X_all <- booami_sim[, 1:25, drop = FALSE]
  y_all <- booami_sim[, 26]
  keep <- !is.na(y_all)
  X_all <- X_all[keep, , drop = FALSE]
  y_all <- y_all[keep]

  n <- nrow(X_all); p <- ncol(X_all)
  folds <- sample(rep(seq_len(k), length.out = n))

  X_train_list <- vector("list", k)
  y_train_list <- vector("list", k)
  X_val_list   <- vector("list", k)
  y_val_list   <- vector("list", k)

  for (cv in seq_len(k)) {
    tr <- folds != cv; va <- !tr
    Xtr <- X_all[tr, , drop = FALSE]; ytr <- y_all[tr]
    Xva <- X_all[va, , drop = FALSE]; yva <- y_all[va]

    pm_tr  <- mice::quickpred(Xtr, method = "spearman", mincor = 0.20, minpuc = 0.40)
    imp_tr <- mice::mice(Xtr, m = M, predictorMatrix = pm_tr, maxit = 5, printFlag = TRUE)
    imp_va <- mice::mice.mids(imp_tr, newdata = Xva, maxit = 1, printFlag = FALSE)

    X_train_list[[cv]] <- vector("list", M)
    y_train_list[[cv]] <- vector("list", M)
    X_val_list[[cv]]   <- vector("list", M)
    y_val_list[[cv]]   <- vector("list", M)

    for (m in seq_len(M)) {
      tr_m <- mice::complete(imp_tr, m)
      va_m <- mice::complete(imp_va, m)
      X_train_list[[cv]][[m]] <- data.matrix(tr_m)
      y_train_list[[cv]][[m]] <- ytr
      X_val_list[[cv]][[m]]   <- data.matrix(va_m)
      y_val_list[[cv]][[m]]   <- yva
    }
  }

  pm_full  <- mice::quickpred(X_all, method = "spearman", mincor = 0.20, minpuc = 0.40)
  imp_full <- mice::mice(X_all, m = M, predictorMatrix = pm_full, maxit = 5, printFlag = TRUE)
  X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)))
  y_full <- lapply(seq_len(M), function(m) y_all)

  res_heavy <- cv_boost_imputed(
    X_train_list, y_train_list,
    X_val_list,   y_val_list,
    X_full,       y_full,
    ny = 0.1, mstop = 250, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto",
    show_progress = TRUE
  )
  str(res_heavy)

booami documentation built on Feb. 19, 2026, 5:07 p.m.