| booami-package | R Documentation |
booami provides component-wise gradient boosting tailored for analysis with multiply imputed datasets. Its core contribution is MIBoost, an algorithm that couples base-learner selection across imputed datasets by minimizing an aggregated loss at each iteration, yielding a single, unified regularization path and improved model stability. For comparison, booami also includes per-dataset boosting with post-hoc pooling (estimate averaging or selection-frequency thresholding).
In each boosting iteration, candidate base-learners are fit separately within each imputed dataset, but selection is made jointly via the aggregated loss across datasets. The selected base-learner is then updated in every imputed dataset, and fitted contributions are averaged to form a single combined predictor. This enforces uniform variable selection while preserving dataset-specific gradients and updates.
booami implements a leakage-avoiding CV protocol:
data are first split into training and validation subsets; the training
covariates are multiply imputed; validation covariates are imputed using the
training imputation models; and (if enabled) centering uses a fold-specific
grand mean \mu_\star computed from the training imputations and applied
consistently to all imputed training and validation matrices. Errors are
averaged across imputations and folds to select the optimal number of boosting
iterations (mstop). Use cv_boost_raw for raw data with
missing covariates (imputation inside CV), or cv_boost_imputed
when imputed datasets are already prepared.
Note: In the recommended predictive workflow implemented by
cv_boost_raw(), rows with missing outcomes y are removed before
fold assignment, and the outcome is not used for imputation (covariates X
are imputed without including y as a predictor).
MIBoost (uniform selection): Joint base-learner selection via aggregated loss across imputed datasets; averaged fitted functions yield a single model.
Per-dataset boosting (with pooling): Independent boosting in each imputed dataset, with pooling by estimate averaging or by selection-frequency thresholding.
Flexible losses and learners: Supports Gaussian and logistic losses with component-wise base-learners; extensible to other learners.
Leakage-safe CV: Training/validation split → train-only imputation of
covariates → fold-wise grand-mean centering (\mu_\star) → error
aggregation across imputations and folds.
impu_boost — Core routine implementing MIBoost as well as
per-dataset boosting with pooling.
cv_boost_raw — Leakage-safe k-fold CV starting from a single
dataset with missing covariates (imputation performed inside each fold).
cv_boost_imputed — CV when imputed datasets (and splits) are
already available.
Raw data with missing covariates: use cv_boost_raw() to impute
within folds, select mstop, and fit the final model.
Already imputed datasets: use cv_boost_imputed() to select
mstop and fit.
Direct control: call impu_boost() when you want to run
MIBoost (or per-dataset boosting) directly, optionally followed by pooling.
At boosting iteration t, for each candidate base-learner r and
each imputed dataset m = 1,\dots,M, let
RSS_r^{(m)[t]} denote the residual sum of squares.
The aggregated loss is
L_r^{[t]} = \sum_{m=1}^M RSS_r^{(m)[t]}.
The base-learner r^* with minimal aggregated loss is selected jointly,
updated in all imputed datasets, and the fitted contributions are averaged to
form the combined predictor. After t_{\mathrm{stop}} iterations, this
yields a single final model.
Buehlmann, P. and Hothorn, T. (2007). "Boosting Algorithms: Regularization, Prediction and Model Fitting." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/07-STS242")}
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/aos/1013203451")}
van Buuren, S. and Groothuis-Oudshoorn, K. (2011). "mice: Multivariate Imputation by Chained Equations in R." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v045.i03")}
For details, see: Kuchen, R. (2025). "MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.
mboost: General framework for component-wise gradient boosting in R.
miselect: Implements MI-extensions of LASSO and elastic nets for variable selection after multiple imputation.
mice: Standard tool for multiple imputation of missing data.
Maintainer: Robert Kuchen rokuchen@uni-mainz.de
Useful links:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.