| cv_boost_raw | R Documentation |
Performs k-fold cross-validation for impu_boost on data with
missing values. Within each fold, multiple imputation, centering, model
fitting, and validation are performed in a leakage-avoiding manner to select
the optimal number of boosting iterations (mstop). The final model is
then fitted on multiple imputations of the full dataset at the selected
stopping iteration.
cv_boost_raw(
X,
y,
k = 5,
ny = 0.1,
mstop = 250,
type = c("gaussian", "logistic"),
MIBoost = TRUE,
pool = TRUE,
pool_threshold = 0,
impute_args = list(m = 10, maxit = 5, printFlag = FALSE),
impute_method = NULL,
use_quickpred = TRUE,
quickpred_args = list(mincor = 0.1, minpuc = 0.5, method = NULL, include = NULL,
exclude = NULL),
seed = 123,
show_progress = TRUE,
return_full_imputations = FALSE,
center = "auto"
)
X |
A data.frame or matrix of predictors of size |
y |
A vector of length |
k |
Number of cross-validation folds. Default is |
ny |
Learning rate. Defaults to |
mstop |
Maximum number of boosting iterations to evaluate during
cross-validation. The selected |
type |
Type of loss function. One of:
|
MIBoost |
Logical. If |
pool |
Logical. If |
pool_threshold |
Only used when |
impute_args |
A named list of arguments forwarded to |
impute_method |
Optional named character vector passed to
|
use_quickpred |
Logical. If |
quickpred_args |
A named list of arguments forwarded to
|
seed |
Base random seed for fold assignment. If |
show_progress |
Logical. If |
return_full_imputations |
Logical. If |
center |
One of If centering is applied, a single grand mean vector |
Rows with missing outcomes y are removed before fold assignment.
Within each CV fold, the remaining data are first split into a training subset
and a validation subset. Multiple imputation is then performed on the
covariates X only (the outcome is never imputed and is not used as a
predictor in the imputation models). The training covariates are multiply
imputed M times using mice, producing M imputed training
datasets. The corresponding validation covariates are then imputed M
times using the imputation models learned from the training data (leakage-avoiding).
If centering is applied, a single grand mean vector \mu_\star is
computed from the imputed training covariates in the corresponding fold and
subtracted from all imputed training and validation covariate matrices
in that fold.
impu_boost is run on the imputed training datasets for up to
mstop boosting iterations. At each iteration, prediction errors are
computed on the corresponding validation datasets and averaged across
imputations. This yields an aggregated error curve per fold, which is then
averaged across folds. The optimal stopping iteration is chosen as the
mstop value minimizing the mean CV error.
Finally, the full covariate matrix X is multiply imputed M times.
If centering is applied, it uses a grand mean \mu_\star computed across
the M full-data imputations. impu_boost is applied to these
datasets for the selected number of boosting iterations to obtain the final model.
Imputation control. All key mice settings can be passed via
impute_args (a named list forwarded to mice::mice()) and/or
impute_method (a named character vector of per-variable methods).
Internally, the function builds a full default method vector from the actual
data given to mice(), then merges any user-supplied entries
by name. The names in impute_method must exactly match the
column names in X (i.e., the data passed to mice()). Partial
vectors are allowed; variables not listed fall back to defaults; unknown names
are ignored with a warning. The function sets and may override data,
method (after merging overrides), predictorMatrix, and
ignore (to enforce train-only learning). Predictor matrices can be built
with mice::quickpred() (see use_quickpred, quickpred_args)
or with mice::make.predictorMatrix().
A list with:
CV_error: numeric vector (length mstop) of mean CV loss.
best_mstop: integer index minimizing CV_error.
final_model: numeric vector of length 1 + p with the
intercept and pooled coefficients of the final fit on full-data
imputations at best_mstop.
full_imputations: (optional) when return_full_imputations=TRUE,
a list list(X = <list length m>, y = <list length m>) containing
the full-data imputations used for the final model.
folds: integer vector of length n giving the CV fold id
for each observation (1..k).
Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.
impu_boost, cv_boost_imputed, mice
utils::data(booami_sim)
X <- booami_sim[, 1:25]
y <- booami_sim[, 26]
res <- cv_boost_raw(
X = X, y = y,
k = 2, seed = 123,
impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1),
quickpred_args = list(mincor = 0.30, minpuc = 0.60),
mstop = 50,
show_progress = FALSE
)
# Partial custom imputation method override (X variables only)
meth <- c(X1 = "pmm")
res2 <- cv_boost_raw(
X = X, y = y,
k = 2, seed = 123,
impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 456),
quickpred_args = list(mincor = 0.30, minpuc = 0.60),
mstop = 50,
impute_method = meth,
show_progress = FALSE
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.