cv_boost_raw: Cross-Validated Component-Wise Gradient Boosting with...
In booami: Component-Wise Gradient Boosting after Multiple Imputation

View source: R/cv_boost.R

cv_boost_raw

R Documentation

Cross-Validated Component-Wise Gradient Boosting with Multiple Imputation Performed Inside Each Fold

Description

Performs k-fold cross-validation for impu_boost on data with missing values. Within each fold, multiple imputation, centering, model fitting, and validation are performed in a leakage-avoiding manner to select the optimal number of boosting iterations (mstop). The final model is then fitted on multiple imputations of the full dataset at the selected stopping iteration.

Usage

cv_boost_raw(
  X,
  y,
  k = 5,
  ny = 0.1,
  mstop = 250,
  type = c("gaussian", "logistic"),
  MIBoost = TRUE,
  pool = TRUE,
  pool_threshold = 0,
  impute_args = list(m = 10, maxit = 5, printFlag = FALSE),
  impute_method = NULL,
  use_quickpred = TRUE,
  quickpred_args = list(mincor = 0.1, minpuc = 0.5, method = NULL, include = NULL,
    exclude = NULL),
  seed = 123,
  show_progress = TRUE,
  return_full_imputations = FALSE,
  center = "auto"
)

Arguments

`X`	A data.frame or matrix of predictors of size `n \times p` containing missing values. Column names are preserved. Rows with missing outcomes `y` are removed before CV. If no missing values are present in `X` after removing missing-`y` rows, use `cv_boost_imputed` instead.
`y`	A vector of length `n` with the outcome (numeric for `type = "gaussian"`; numeric `0/1` or a 2-level factor for `type = "logistic"`). Must align with `X` rows. Rows with missing `y` are removed before CV. The outcome is never imputed and is not used as a predictor in imputation models.
`k`	Number of cross-validation folds. Default is `5`.
`ny`	Learning rate. Defaults to `0.1`.
`mstop`	Maximum number of boosting iterations to evaluate during cross-validation. The selected `mstop` is the value minimizing the mean CV error over `1:mstop`. Default is `250`.
`type`	Type of loss function. One of: `"gaussian"` (mean squared error) for continuous responses, or `"logistic"` (binomial deviance) for binary responses.
`MIBoost`	Logical. If `TRUE`, applies the MIBoost algorithm, which enforces uniform variable selection across all imputed datasets. If `FALSE`, variables are selected independently within each imputed dataset, and pooling is governed by `pool_threshold`.
`pool`	Logical. If `TRUE`, models across the `M` imputed datasets are aggregated into a single final model. If `FALSE`, `M` separate models are returned.
`pool_threshold`	Only used when `MIBoost = FALSE` and `pool = TRUE`. Controls the pooling rule when aggregating the `M` models obtained from the imputed datasets into a single final model. A candidate variable is included only if it is selected in at least `pool_threshold` (a value in (0, 1) proportion of the imputed datasets; coefficients of all other variables are set to zero. A value of `0` corresponds to estimate-averaging, while values `> 0` correspond to selection-frequency thresholding.
`impute_args`	A named list of arguments forwarded to `mice::mice()` both inside CV and on the full dataset (e.g., `m`, `maxit`, `seed`, `printFlag`, etc.). Internally, `data`, `predictorMatrix`, and `ignore` are set by the function and will override any values supplied here. If `m` is missing, a default of `10` is used.
`impute_method`	Optional named character vector passed to `mice::mice(method = ...)` to control per-variable methods for covariates `X` (e.g., `"pmm"`, `"logreg"`). This may be a partial vector: entries are merged by name into a full default method vector derived from `X`; unmatched names are ignored with a warning. If `NULL` (default), numeric covariates use `"pmm"`.
`use_quickpred`	Logical. If `TRUE` (default), build the `predictorMatrix` via `mice::quickpred()` on the training covariates within each fold; otherwise use `mice::make.predictorMatrix()`.
`quickpred_args`	A named list of arguments forwarded to `mice::quickpred()` (e.g., `mincor`, `minpuc`, `method`, `include`, `exclude`). Ignored when `use_quickpred = FALSE`.
`seed`	Base random seed for fold assignment. If `impute_args$seed` is not supplied, this value also seeds imputation; otherwise the user-specified `impute_args$seed` is respected and deterministically offset per fold. RNG state is restored on exit. Default `123`.
`show_progress`	Logical. If `TRUE` (default), print progress for the imputation and boosting phases, plus a summary at completion.
`return_full_imputations`	Logical. If `TRUE`, attach the list of full-data imputations used for the final fit as `$full_imputations = list(X = <list length m>, y = <list length m>)`. Default is `FALSE`.
`center`	One of `c("auto", "off", "force")`. Controls centering of `X`. With `"auto"` (recommended), centering is applied only if the training data are not already centered (checked across imputations). With `"force"`, centering is always applied. With `"off"`, centering is skipped. If centering is applied, a single grand mean vector `\mu_\star` is computed from the imputed training covariates in the corresponding fold and subtracted from all imputed training and validation matrices in that fold (and analogously for the final model fit on the full-data imputations).

Details

Rows with missing outcomes y are removed before fold assignment. Within each CV fold, the remaining data are first split into a training subset and a validation subset. Multiple imputation is then performed on the covariates X only (the outcome is never imputed and is not used as a predictor in the imputation models). The training covariates are multiply imputed M times using mice, producing M imputed training datasets. The corresponding validation covariates are then imputed M times using the imputation models learned from the training data (leakage-avoiding).

If centering is applied, a single grand mean vector \mu_\star is computed from the imputed training covariates in the corresponding fold and subtracted from all imputed training and validation covariate matrices in that fold.

impu_boost is run on the imputed training datasets for up to mstop boosting iterations. At each iteration, prediction errors are computed on the corresponding validation datasets and averaged across imputations. This yields an aggregated error curve per fold, which is then averaged across folds. The optimal stopping iteration is chosen as the mstop value minimizing the mean CV error.

Finally, the full covariate matrix X is multiply imputed M times. If centering is applied, it uses a grand mean \mu_\star computed across the M full-data imputations. impu_boost is applied to these datasets for the selected number of boosting iterations to obtain the final model.

Imputation control. All key mice settings can be passed via impute_args (a named list forwarded to mice::mice()) and/or impute_method (a named character vector of per-variable methods). Internally, the function builds a full default method vector from the actual data given to mice(), then merges any user-supplied entries by name. The names in impute_method must exactly match the column names in X (i.e., the data passed to mice()). Partial vectors are allowed; variables not listed fall back to defaults; unknown names are ignored with a warning. The function sets and may override data, method (after merging overrides), predictorMatrix, and ignore (to enforce train-only learning). Predictor matrices can be built with mice::quickpred() (see use_quickpred, quickpred_args) or with mice::make.predictorMatrix().

Value

A list with:

CV_error: numeric vector (length mstop) of mean CV loss.
best_mstop: integer index minimizing CV_error.
final_model: numeric vector of length 1 + p with the intercept and pooled coefficients of the final fit on full-data imputations at best_mstop.
full_imputations: (optional) when return_full_imputations=TRUE, a list list(X = <list length m>, y = <list length m>) containing the full-data imputations used for the final model.
folds: integer vector of length n giving the CV fold id for each observation (1..k).

References

Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.

Examples



  utils::data(booami_sim)
  X <- booami_sim[, 1:25]
  y <- booami_sim[, 26]

  res <- cv_boost_raw(
    X = X, y = y,
    k = 2, seed = 123,
    impute_args    = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1),
    quickpred_args = list(mincor = 0.30, minpuc = 0.60),
    mstop = 50,
    show_progress = FALSE
  )

  # Partial custom imputation method override (X variables only)
  meth <- c(X1 = "pmm")
  res2 <- cv_boost_raw(
    X = X, y = y,
    k = 2, seed = 123,
    impute_args    = list(m = 2, maxit = 1, printFlag = FALSE, seed = 456),
    quickpred_args = list(mincor = 0.30, minpuc = 0.60),
    mstop = 50,
    impute_method  = meth,
    show_progress = FALSE
  )

booami documentation built on Feb. 19, 2026, 5:07 p.m.