cv_boost_raw: Cross-Validated Component-Wise Gradient Boosting with...

View source: R/cv_boost.R

cv_boost_rawR Documentation

Cross-Validated Component-Wise Gradient Boosting with Multiple Imputation Performed Inside Each Fold

Description

Performs k-fold cross-validation for impu_boost on data with missing values. Within each fold, multiple imputation, centering, model fitting, and validation are performed in a leakage-avoiding manner to select the optimal number of boosting iterations (mstop). The final model is then fitted on multiple imputations of the full dataset at the selected stopping iteration.

Usage

cv_boost_raw(
  X,
  y,
  k = 5,
  ny = 0.1,
  mstop = 250,
  type = c("gaussian", "logistic"),
  MIBoost = TRUE,
  pool = TRUE,
  pool_threshold = 0,
  impute_args = list(m = 10, maxit = 5, printFlag = FALSE),
  impute_method = NULL,
  use_quickpred = TRUE,
  quickpred_args = list(mincor = 0.1, minpuc = 0.5, method = NULL, include = NULL,
    exclude = NULL),
  seed = 123,
  show_progress = TRUE,
  return_full_imputations = FALSE,
  center = "auto"
)

Arguments

X

A data.frame or matrix of predictors of size n \times p containing missing values. Column names are preserved. Rows with missing outcomes y are removed before CV. If no missing values are present in X after removing missing-y rows, use cv_boost_imputed instead.

y

A vector of length n with the outcome (numeric for type = "gaussian"; numeric 0/1 or a 2-level factor for type = "logistic"). Must align with X rows. Rows with missing y are removed before CV. The outcome is never imputed and is not used as a predictor in imputation models.

k

Number of cross-validation folds. Default is 5.

ny

Learning rate. Defaults to 0.1.

mstop

Maximum number of boosting iterations to evaluate during cross-validation. The selected mstop is the value minimizing the mean CV error over 1:mstop. Default is 250.

type

Type of loss function. One of: "gaussian" (mean squared error) for continuous responses, or "logistic" (binomial deviance) for binary responses.

MIBoost

Logical. If TRUE, applies the MIBoost algorithm, which enforces uniform variable selection across all imputed datasets. If FALSE, variables are selected independently within each imputed dataset, and pooling is governed by pool_threshold.

pool

Logical. If TRUE, models across the M imputed datasets are aggregated into a single final model. If FALSE, M separate models are returned.

pool_threshold

Only used when MIBoost = FALSE and pool = TRUE. Controls the pooling rule when aggregating the M models obtained from the imputed datasets into a single final model. A candidate variable is included only if it is selected in at least pool_threshold (a value in (0, 1) proportion of the imputed datasets; coefficients of all other variables are set to zero. A value of 0 corresponds to estimate-averaging, while values > 0 correspond to selection-frequency thresholding.

impute_args

A named list of arguments forwarded to mice::mice() both inside CV and on the full dataset (e.g., m, maxit, seed, printFlag, etc.). Internally, data, predictorMatrix, and ignore are set by the function and will override any values supplied here. If m is missing, a default of 10 is used.

impute_method

Optional named character vector passed to mice::mice(method = ...) to control per-variable methods for covariates X (e.g., "pmm", "logreg"). This may be a partial vector: entries are merged by name into a full default method vector derived from X; unmatched names are ignored with a warning. If NULL (default), numeric covariates use "pmm".

use_quickpred

Logical. If TRUE (default), build the predictorMatrix via mice::quickpred() on the training covariates within each fold; otherwise use mice::make.predictorMatrix().

quickpred_args

A named list of arguments forwarded to mice::quickpred() (e.g., mincor, minpuc, method, include, exclude). Ignored when use_quickpred = FALSE.

seed

Base random seed for fold assignment. If impute_args$seed is not supplied, this value also seeds imputation; otherwise the user-specified impute_args$seed is respected and deterministically offset per fold. RNG state is restored on exit. Default 123.

show_progress

Logical. If TRUE (default), print progress for the imputation and boosting phases, plus a summary at completion.

return_full_imputations

Logical. If TRUE, attach the list of full-data imputations used for the final fit as $full_imputations = list(X = <list length m>, y = <list length m>). Default is FALSE.

center

One of c("auto", "off", "force"). Controls centering of X. With "auto" (recommended), centering is applied only if the training data are not already centered (checked across imputations). With "force", centering is always applied. With "off", centering is skipped.

If centering is applied, a single grand mean vector \mu_\star is computed from the imputed training covariates in the corresponding fold and subtracted from all imputed training and validation matrices in that fold (and analogously for the final model fit on the full-data imputations).

Details

Rows with missing outcomes y are removed before fold assignment. Within each CV fold, the remaining data are first split into a training subset and a validation subset. Multiple imputation is then performed on the covariates X only (the outcome is never imputed and is not used as a predictor in the imputation models). The training covariates are multiply imputed M times using mice, producing M imputed training datasets. The corresponding validation covariates are then imputed M times using the imputation models learned from the training data (leakage-avoiding).

If centering is applied, a single grand mean vector \mu_\star is computed from the imputed training covariates in the corresponding fold and subtracted from all imputed training and validation covariate matrices in that fold.

impu_boost is run on the imputed training datasets for up to mstop boosting iterations. At each iteration, prediction errors are computed on the corresponding validation datasets and averaged across imputations. This yields an aggregated error curve per fold, which is then averaged across folds. The optimal stopping iteration is chosen as the mstop value minimizing the mean CV error.

Finally, the full covariate matrix X is multiply imputed M times. If centering is applied, it uses a grand mean \mu_\star computed across the M full-data imputations. impu_boost is applied to these datasets for the selected number of boosting iterations to obtain the final model.

Imputation control. All key mice settings can be passed via impute_args (a named list forwarded to mice::mice()) and/or impute_method (a named character vector of per-variable methods). Internally, the function builds a full default method vector from the actual data given to mice(), then merges any user-supplied entries by name. The names in impute_method must exactly match the column names in X (i.e., the data passed to mice()). Partial vectors are allowed; variables not listed fall back to defaults; unknown names are ignored with a warning. The function sets and may override data, method (after merging overrides), predictorMatrix, and ignore (to enforce train-only learning). Predictor matrices can be built with mice::quickpred() (see use_quickpred, quickpred_args) or with mice::make.predictorMatrix().

Value

A list with:

  • CV_error: numeric vector (length mstop) of mean CV loss.

  • best_mstop: integer index minimizing CV_error.

  • final_model: numeric vector of length 1 + p with the intercept and pooled coefficients of the final fit on full-data imputations at best_mstop.

  • full_imputations: (optional) when return_full_imputations=TRUE, a list list(X = <list length m>, y = <list length m>) containing the full-data imputations used for the final model.

  • folds: integer vector of length n giving the CV fold id for each observation (1..k).

References

Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.

See Also

impu_boost, cv_boost_imputed, mice

Examples



  utils::data(booami_sim)
  X <- booami_sim[, 1:25]
  y <- booami_sim[, 26]

  res <- cv_boost_raw(
    X = X, y = y,
    k = 2, seed = 123,
    impute_args    = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1),
    quickpred_args = list(mincor = 0.30, minpuc = 0.60),
    mstop = 50,
    show_progress = FALSE
  )

  # Partial custom imputation method override (X variables only)
  meth <- c(X1 = "pmm")
  res2 <- cv_boost_raw(
    X = X, y = y,
    k = 2, seed = 123,
    impute_args    = list(m = 2, maxit = 1, printFlag = FALSE, seed = 456),
    quickpred_args = list(mincor = 0.30, minpuc = 0.60),
    mstop = 50,
    impute_method  = meth,
    show_progress = FALSE
  )
  



booami documentation built on Feb. 19, 2026, 5:07 p.m.