impu_boost: Component-Wise Gradient Boosting Across Multiply Imputed...

View source: R/impu_boos.R

impu_boostR Documentation

Component-Wise Gradient Boosting Across Multiply Imputed Datasets

Description

Applies component-wise gradient boosting to multiply imputed datasets. Depending on the settings, either a separate model is reported for each imputed dataset, or the M models are pooled to yield a single final model. For pooling, one can choose the novel MIBoost algorithm, which enforces a uniform variable-selection scheme across all imputed datasets, or the more conventional ad-hoc approaches of estimate-averaging and selection-frequency thresholding.

Usage

impu_boost(
  X_list,
  y_list,
  X_list_val = NULL,
  y_list_val = NULL,
  ny = 0.1,
  mstop = 250,
  type = c("gaussian", "logistic"),
  MIBoost = TRUE,
  pool = TRUE,
  pool_threshold = 0,
  center = c("auto", "force", "off")
)

Arguments

X_list

List of length M; each element is an n \times p numeric predictor matrix from one imputed dataset.

y_list

List of length M; each element is a length-n numeric response vector from one imputed dataset.

X_list_val

Optional validation list (same structure as X_list).

y_list_val

Optional validation list (same structure as y_list).

ny

Learning rate. Defaults to 0.1.

mstop

Number of boosting iterations (default 250).

type

Type of loss function. One of: "gaussian" (mean squared error) for continuous responses, or "logistic" (binomial deviance) for binary responses.

MIBoost

Logical. If TRUE, applies the MIBoost algorithm, which enforces uniform variable selection across all imputed datasets. If FALSE, variables are selected independently within each imputed dataset, and pooling is governed by pool_threshold.

pool

Logical. If TRUE, models across the M imputed datasets are aggregated into a single final model. If FALSE, M separate models are returned.

pool_threshold

Only used when MIBoost = FALSE and pool = TRUE. Controls the pooling rule when aggregating the M models obtained from the imputed datasets into a single final model. A candidate variable is included only if it is selected in at least pool_threshold (a value in (0, 1)) proportion of the imputed datasets; coefficients of all other variables are set to zero. A value of 0 corresponds to estimate-averaging, while values > 0 correspond to selection-frequency thresholding.

center

One of c("auto", "off", "force"). Controls centering of X within each imputed dataset. With "auto" (recommended), centering is applied only if the training matrix is not already centered. With "force", centering is always applied. With "off", centering is skipped. If X_list_val is provided, validation sets are centered using the means from the corresponding training set.

Details

This function supports MIBoost, which enforces uniform variable selection across multiply imputed datasets. For full methodology, see Kuchen (2025).

Value

A list with elements:

  • INT: intercept(s). A scalar if pool = TRUE, otherwise a length-M vector.

  • BETA: coefficient estimates. A length-p vector if pool = TRUE, otherwise an M \times p matrix.

  • CV_error: vector of validation errors (if validation data were provided), otherwise NULL.

References

Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.

See Also

simulate_booami_data, cv_boost_raw, cv_boost_imputed

Examples




  set.seed(123)
  utils::data(booami_sim)

  M <- 2
  n <- nrow(booami_sim)
  x_cols <- grepl("^X\\d+$", names(booami_sim))

  tr_idx <- sample(seq_len(n), floor(0.8 * n))
  dat_tr <- booami_sim[tr_idx, , drop = FALSE]
  dat_va <- booami_sim[-tr_idx, , drop = FALSE]

  pm_tr <- mice::quickpred(dat_tr, method = "spearman",
                           mincor = 0.30, minpuc = 0.60)

  imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr,
                       maxit = 1, printFlag = FALSE)
  imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE)

  X_list      <- vector("list", M)
  y_list      <- vector("list", M)
  X_list_val  <- vector("list", M)
  y_list_val  <- vector("list", M)
  for (m in seq_len(M)) {
    tr_m <- mice::complete(imp_tr, m)
    va_m <- mice::complete(imp_va, m)
    X_list[[m]]     <- data.matrix(tr_m[, x_cols, drop = FALSE])
    y_list[[m]]     <- tr_m$y
    X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE])
    y_list_val[[m]] <- va_m$y
  }

  fit <- impu_boost(
    X_list, y_list,
    X_list_val = X_list_val, y_list_val = y_list_val,
    ny = 0.1, mstop = 50, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto"
  )

  which.min(fit$CV_error)
  head(fit$BETA)
  fit$INT


## Not run: 
# Heavier demo (more imputed datasets and iterations; for local runs)

  set.seed(2025)
  utils::data(booami_sim)

  M <- 10
  n <- nrow(booami_sim)
  x_cols <- grepl("^X\\d+$", names(booami_sim))

  tr_idx <- sample(seq_len(n), floor(0.8 * n))
  dat_tr <- booami_sim[tr_idx, , drop = FALSE]
  dat_va <- booami_sim[-tr_idx, , drop = FALSE]

  pm_tr <- mice::quickpred(dat_tr, method = "spearman",
                           mincor = 0.20, minpuc = 0.40)

  imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr,
                       maxit = 5, printFlag = TRUE)
  imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE)

  X_list      <- vector("list", M)
  y_list      <- vector("list", M)
  X_list_val  <- vector("list", M)
  y_list_val  <- vector("list", M)
  for (m in seq_len(M)) {
    tr_m <- mice::complete(imp_tr, m)
    va_m <- mice::complete(imp_va, m)
    X_list[[m]]     <- data.matrix(tr_m[, x_cols, drop = FALSE])
    y_list[[m]]     <- tr_m$y
    X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE])
    y_list_val[[m]] <- va_m$y
  }

  fit_heavy <- impu_boost(
    X_list, y_list,
    X_list_val = X_list_val, y_list_val = y_list_val,
    ny = 0.1, mstop = 250, type = "gaussian",
    MIBoost = TRUE, pool = TRUE, center = "auto"
  )
  str(fit_heavy)

## End(Not run)


booami documentation built on Feb. 19, 2026, 5:07 p.m.