cv_boost_imputed: Cross-validation for boosting after multiple imputation...

View source: R/cv_boost.R

cv_boost_imputedR Documentation

Cross-validation for boosting after multiple imputation (pre-imputed inputs)

Description

To avoid data leakage, each CV fold should first be split into training and validation subsets, after which imputation is performed. For the final model, all data should be imputed independently.

Usage

cv_boost_imputed(
  X_train_list,
  y_train_list,
  X_val_list,
  y_val_list,
  X_full,
  y_full,
  ny = 0.1,
  mstop = 250,
  type = c("gaussian", "logistic"),
  MIBoost = TRUE,
  pool = TRUE,
  pool_threshold = 0,
  show_progress = TRUE,
  center = c("auto", "off", "force")
)

Arguments

X_train_list

A list of length k. Element i is itself a list of length M containing the n_{train} \times p numeric design matrices for each imputed dataset in CV fold i.

y_train_list

A list of length k. Element i is a list of length M, where each element is a length-n_{train} numeric response vector aligned with X_train_list[[i]][[m]].

X_val_list

A list of length k. Element i is a list of length M containing the n_{val} \times p numeric validation matrices matched to the corresponding imputed training dataset in fold i.

y_val_list

A list of length k. Element i is a list of length M, where each element is a length-n_{val} continuous response vector aligned with X_val_list[[i]][[m]].

X_full

A list of length M with the n \times p numeric full-data design matrices (one per imputed dataset) used to fit the final model.

y_full

A list of length M, where each element is a length-n continuous response vector corresponding to the imputed dataset in X_full.

ny

Learning rate. Defaults to 0.1.

mstop

Maximum number of boosting iterations to evaluate during cross-validation. The selected mstop is the value that minimizes the mean CV error over 1:mstop. Default is 250.

type

Type of loss function. One of: "gaussian" (mean squared error) for continuous responses, or "logistic" (binomial deviance) for binary responses.

MIBoost

Logical. If TRUE, applies the MIBoost algorithm, which enforces uniform variable selection across all imputed datasets. If FALSE, variables are selected independently within each imputed dataset, and pooling is governed by pool_threshold.

pool

Logical. If TRUE, models across the M imputed datasets are aggregated into a single final model. If FALSE, M separate models are returned.

pool_threshold

Only used when MIBoost = FALSE and pool = TRUE. Controls the pooling rule when aggregating the M models obtained from the imputed datasets into a single final model. A candidate variable is included only if it is selected in at least pool_threshold (a value in (0, 1]) proportion of the imputed datasets; coefficients of all other variables are set to zero. A value of 0 corresponds to estimate-averaging, while values > 0 correspond to selection-frequency thresholding.

show_progress

Logical; print fold-level progress and summary timings. Default TRUE.

center

One of c("auto", "off", "force"). Controls centering of X. With "auto" (recommended), centering is applied only if the training data are not already centered (checked across imputations). With "force", centering is always applied. With "off", centering is skipped.

If centering is applied, a single grand mean vector \mu_\star is computed from the training imputations in the corresponding fold and then subtracted from all imputed training and validation matrices in that fold (and analogously for the final model fit on X_full).

Details

The recommended workflow is illustrated in the examples.

Centering affects only X; y is left unchanged. For type = "logistic", responses are treated as numeric 0/1 via the logistic link. Validation loss is averaged over imputations and then over folds.

Value

A list with:

  • CV_error: numeric vector of length mstop with the mean cross-validated loss across folds (and imputations).

  • best_mstop: integer index of the minimizing entry in CV_error.

  • final_model: numeric vector of length 1 + p containing the intercept followed by p coefficients of the final pooled model fitted at best_mstop on X_full/y_full.

  • center_means: (optional) numeric vector of length p containing the centering means used for X (when available).

References

Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.

See Also

impu_boost, cv_boost_raw

Examples



  set.seed(123)
  utils::data(booami_sim)
  k <- 2; M <- 2

  # Separate X and y; drop missing y (policy)
  X_all <- booami_sim[, 1:25, drop = FALSE]
  y_all <- booami_sim[, 26]
  keep <- !is.na(y_all)
  X_all <- X_all[keep, , drop = FALSE]
  y_all <- y_all[keep]

  n <- nrow(X_all); p <- ncol(X_all)
  folds <- sample(rep(seq_len(k), length.out = n))

  X_train_list <- vector("list", k)
  y_train_list <- vector("list", k)
  X_val_list   <- vector("list", k)
  y_val_list   <- vector("list", k)

  for (cv in seq_len(k)) {
    tr <- folds != cv
    va <- !tr
    Xtr <- X_all[tr, , drop = FALSE]; ytr <- y_all[tr]
    Xva <- X_all[va, , drop = FALSE]; yva <- y_all[va]

    # Impute X only (y is never used for imputation)
    pm_tr  <- mice::quickpred(Xtr, method = "spearman", mincor = 0.30, minpuc = 0.60)
    imp_tr <- mice::mice(Xtr, m = M, predictorMatrix = pm_tr, maxit = 1, printFlag = FALSE)
    imp_va <- mice::mice.mids(imp_tr, newdata = Xva, maxit = 1, printFlag = FALSE)

    X_train_list[[cv]] <- vector("list", M)
    y_train_list[[cv]] <- vector("list", M)
    X_val_list[[cv]]   <- vector("list", M)
    y_val_list[[cv]]   <- vector("list", M)

    for (m in seq_len(M)) {
      tr_m <- mice::complete(imp_tr, m)
      va_m <- mice::complete(imp_va, m)
      X_train_list[[cv]][[m]] <- data.matrix(tr_m)
      y_train_list[[cv]][[m]] <- ytr
      X_val_list[[cv]][[m]]   <- data.matrix(va_m)
      y_val_list[[cv]][[m]]   <- yva
    }
  }

  # Full-data imputations (X only)
  pm_full  <- mice::quickpred(X_all, method = "spearman", mincor = 0.30, minpuc = 0.60)
  imp_full <- mice::mice(X_all, m = M, predictorMatrix = pm_full, maxit = 1, printFlag = FALSE)
  X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)))
  y_full <- lapply(seq_len(M), function(m) y_all)

  res <- cv_boost_imputed(
    X_train_list, y_train_list,
    X_val_list,   y_val_list,
    X_full,       y_full,
    ny = 0.1, mstop = 50, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto",
    show_progress = FALSE
  )
  



  set.seed(2025)
  utils::data(booami_sim)
  k <- 5; M <- 10

  X_all <- booami_sim[, 1:25, drop = FALSE]
  y_all <- booami_sim[, 26]
  keep <- !is.na(y_all)
  X_all <- X_all[keep, , drop = FALSE]
  y_all <- y_all[keep]

  n <- nrow(X_all); p <- ncol(X_all)
  folds <- sample(rep(seq_len(k), length.out = n))

  X_train_list <- vector("list", k)
  y_train_list <- vector("list", k)
  X_val_list   <- vector("list", k)
  y_val_list   <- vector("list", k)

  for (cv in seq_len(k)) {
    tr <- folds != cv; va <- !tr
    Xtr <- X_all[tr, , drop = FALSE]; ytr <- y_all[tr]
    Xva <- X_all[va, , drop = FALSE]; yva <- y_all[va]

    pm_tr  <- mice::quickpred(Xtr, method = "spearman", mincor = 0.20, minpuc = 0.40)
    imp_tr <- mice::mice(Xtr, m = M, predictorMatrix = pm_tr, maxit = 5, printFlag = TRUE)
    imp_va <- mice::mice.mids(imp_tr, newdata = Xva, maxit = 1, printFlag = FALSE)

    X_train_list[[cv]] <- vector("list", M)
    y_train_list[[cv]] <- vector("list", M)
    X_val_list[[cv]]   <- vector("list", M)
    y_val_list[[cv]]   <- vector("list", M)

    for (m in seq_len(M)) {
      tr_m <- mice::complete(imp_tr, m)
      va_m <- mice::complete(imp_va, m)
      X_train_list[[cv]][[m]] <- data.matrix(tr_m)
      y_train_list[[cv]][[m]] <- ytr
      X_val_list[[cv]][[m]]   <- data.matrix(va_m)
      y_val_list[[cv]][[m]]   <- yva
    }
  }

  pm_full  <- mice::quickpred(X_all, method = "spearman", mincor = 0.20, minpuc = 0.40)
  imp_full <- mice::mice(X_all, m = M, predictorMatrix = pm_full, maxit = 5, printFlag = TRUE)
  X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)))
  y_full <- lapply(seq_len(M), function(m) y_all)

  res_heavy <- cv_boost_imputed(
    X_train_list, y_train_list,
    X_val_list,   y_val_list,
    X_full,       y_full,
    ny = 0.1, mstop = 250, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto",
    show_progress = TRUE
  )
  str(res_heavy)



booami documentation built on Feb. 19, 2026, 5:07 p.m.