impu_boost: Component-Wise Gradient Boosting Across Multiply Imputed...
In booami: Component-Wise Gradient Boosting after Multiple Imputation

View source: R/impu_boos.R

impu_boost

R Documentation

Component-Wise Gradient Boosting Across Multiply Imputed Datasets

Description

Applies component-wise gradient boosting to multiply imputed datasets. Depending on the settings, either a separate model is reported for each imputed dataset, or the M models are pooled to yield a single final model. For pooling, one can choose the novel MIBoost algorithm, which enforces a uniform variable-selection scheme across all imputed datasets, or the more conventional ad-hoc approaches of estimate-averaging and selection-frequency thresholding.

Usage

impu_boost(
  X_list,
  y_list,
  X_list_val = NULL,
  y_list_val = NULL,
  ny = 0.1,
  mstop = 250,
  type = c("gaussian", "logistic"),
  MIBoost = TRUE,
  pool = TRUE,
  pool_threshold = 0,
  center = c("auto", "force", "off")
)

Arguments

`X_list`	List of length M; each element is an `n \times p` numeric predictor matrix from one imputed dataset.
`y_list`	List of length M; each element is a length-`n` numeric response vector from one imputed dataset.
`X_list_val`	Optional validation list (same structure as `X_list`).
`y_list_val`	Optional validation list (same structure as `y_list`).
`ny`	Learning rate. Defaults to `0.1`.
`mstop`	Number of boosting iterations (default `250`).
`type`	Type of loss function. One of: `"gaussian"` (mean squared error) for continuous responses, or `"logistic"` (binomial deviance) for binary responses.
`MIBoost`	Logical. If `TRUE`, applies the MIBoost algorithm, which enforces uniform variable selection across all imputed datasets. If `FALSE`, variables are selected independently within each imputed dataset, and pooling is governed by `pool_threshold`.
`pool`	Logical. If `TRUE`, models across the `M` imputed datasets are aggregated into a single final model. If `FALSE`, `M` separate models are returned.
`pool_threshold`	Only used when `MIBoost = FALSE` and `pool = TRUE`. Controls the pooling rule when aggregating the `M` models obtained from the imputed datasets into a single final model. A candidate variable is included only if it is selected in at least `pool_threshold` (a value in (0, 1)) proportion of the imputed datasets; coefficients of all other variables are set to zero. A value of `0` corresponds to estimate-averaging, while values `> 0` correspond to selection-frequency thresholding.
`center`	One of `c("auto", "off", "force")`. Controls centering of `X` within each imputed dataset. With `"auto"` (recommended), centering is applied only if the training matrix is not already centered. With `"force"`, centering is always applied. With `"off"`, centering is skipped. If `X_list_val` is provided, validation sets are centered using the means from the corresponding training set.

Details

This function supports MIBoost, which enforces uniform variable selection across multiply imputed datasets. For full methodology, see Kuchen (2025).

Value

A list with elements:

INT: intercept(s). A scalar if pool = TRUE, otherwise a length-M vector.
BETA: coefficient estimates. A length-p vector if pool = TRUE, otherwise an M \times p matrix.
CV_error: vector of validation errors (if validation data were provided), otherwise NULL.

References

Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.

Examples




  set.seed(123)
  utils::data(booami_sim)

  M <- 2
  n <- nrow(booami_sim)
  x_cols <- grepl("^X\\d+$", names(booami_sim))

  tr_idx <- sample(seq_len(n), floor(0.8 * n))
  dat_tr <- booami_sim[tr_idx, , drop = FALSE]
  dat_va <- booami_sim[-tr_idx, , drop = FALSE]

  pm_tr <- mice::quickpred(dat_tr, method = "spearman",
                           mincor = 0.30, minpuc = 0.60)

  imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr,
                       maxit = 1, printFlag = FALSE)
  imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE)

  X_list      <- vector("list", M)
  y_list      <- vector("list", M)
  X_list_val  <- vector("list", M)
  y_list_val  <- vector("list", M)
  for (m in seq_len(M)) {
    tr_m <- mice::complete(imp_tr, m)
    va_m <- mice::complete(imp_va, m)
    X_list[[m]]     <- data.matrix(tr_m[, x_cols, drop = FALSE])
    y_list[[m]]     <- tr_m$y
    X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE])
    y_list_val[[m]] <- va_m$y
  }

  fit <- impu_boost(
    X_list, y_list,
    X_list_val = X_list_val, y_list_val = y_list_val,
    ny = 0.1, mstop = 50, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto"
  )

  which.min(fit$CV_error)
  head(fit$BETA)
  fit$INT


## Not run: 
# Heavier demo (more imputed datasets and iterations; for local runs)

  set.seed(2025)
  utils::data(booami_sim)

  M <- 10
  n <- nrow(booami_sim)
  x_cols <- grepl("^X\\d+$", names(booami_sim))

  tr_idx <- sample(seq_len(n), floor(0.8 * n))
  dat_tr <- booami_sim[tr_idx, , drop = FALSE]
  dat_va <- booami_sim[-tr_idx, , drop = FALSE]

  pm_tr <- mice::quickpred(dat_tr, method = "spearman",
                           mincor = 0.20, minpuc = 0.40)

  imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr,
                       maxit = 5, printFlag = TRUE)
  imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE)

  X_list      <- vector("list", M)
  y_list      <- vector("list", M)
  X_list_val  <- vector("list", M)
  y_list_val  <- vector("list", M)
  for (m in seq_len(M)) {
    tr_m <- mice::complete(imp_tr, m)
    va_m <- mice::complete(imp_va, m)
    X_list[[m]]     <- data.matrix(tr_m[, x_cols, drop = FALSE])
    y_list[[m]]     <- tr_m$y
    X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE])
    y_list_val[[m]] <- va_m$y
  }

  fit_heavy <- impu_boost(
    X_list, y_list,
    X_list_val = X_list_val, y_list_val = y_list_val,
    ny = 0.1, mstop = 250, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto"
  )
  str(fit_heavy)

## End(Not run)

booami documentation built on Feb. 19, 2026, 5:07 p.m.