booami-package: Boosting with Multiple Imputation (booami)

booami-packageR Documentation

Boosting with Multiple Imputation (booami)

Description

booami provides component-wise gradient boosting tailored for analysis with multiply imputed datasets. Its core contribution is MIBoost, an algorithm that couples base-learner selection across imputed datasets by minimizing an aggregated loss at each iteration, yielding a single, unified regularization path and improved model stability. For comparison, booami also includes per-dataset boosting with post-hoc pooling (estimate averaging or selection-frequency thresholding).

Details

What is MIBoost?

In each boosting iteration, candidate base-learners are fit separately within each imputed dataset, but selection is made jointly via the aggregated loss across datasets. The selected base-learner is then updated in every imputed dataset, and fitted contributions are averaged to form a single combined predictor. This enforces uniform variable selection while preserving dataset-specific gradients and updates.

Cross-validation without leakage

booami implements a leakage-avoiding CV protocol: data are first split into training and validation subsets; the training covariates are multiply imputed; validation covariates are imputed using the training imputation models; and (if enabled) centering uses a fold-specific grand mean \mu_\star computed from the training imputations and applied consistently to all imputed training and validation matrices. Errors are averaged across imputations and folds to select the optimal number of boosting iterations (mstop). Use cv_boost_raw for raw data with missing covariates (imputation inside CV), or cv_boost_imputed when imputed datasets are already prepared.

Note: In the recommended predictive workflow implemented by cv_boost_raw(), rows with missing outcomes y are removed before fold assignment, and the outcome is not used for imputation (covariates X are imputed without including y as a predictor).

Key features

  • MIBoost (uniform selection): Joint base-learner selection via aggregated loss across imputed datasets; averaged fitted functions yield a single model.

  • Per-dataset boosting (with pooling): Independent boosting in each imputed dataset, with pooling by estimate averaging or by selection-frequency thresholding.

  • Flexible losses and learners: Supports Gaussian and logistic losses with component-wise base-learners; extensible to other learners.

  • Leakage-safe CV: Training/validation split → train-only imputation of covariates → fold-wise grand-mean centering (\mu_\star) → error aggregation across imputations and folds.

Main functions

  • impu_boost — Core routine implementing MIBoost as well as per-dataset boosting with pooling.

  • cv_boost_raw — Leakage-safe k-fold CV starting from a single dataset with missing covariates (imputation performed inside each fold).

  • cv_boost_imputed — CV when imputed datasets (and splits) are already available.

Typical workflow

  1. Raw data with missing covariates: use cv_boost_raw() to impute within folds, select mstop, and fit the final model.

  2. Already imputed datasets: use cv_boost_imputed() to select mstop and fit.

  3. Direct control: call impu_boost() when you want to run MIBoost (or per-dataset boosting) directly, optionally followed by pooling.

Mathematical sketch

At boosting iteration t, for each candidate base-learner r and each imputed dataset m = 1,\dots,M, let RSS_r^{(m)[t]} denote the residual sum of squares. The aggregated loss is

L_r^{[t]} = \sum_{m=1}^M RSS_r^{(m)[t]}.

The base-learner r^* with minimal aggregated loss is selected jointly, updated in all imputed datasets, and the fitted contributions are averaged to form the combined predictor. After t_{\mathrm{stop}} iterations, this yields a single final model.

References

  • Buehlmann, P. and Hothorn, T. (2007). "Boosting Algorithms: Regularization, Prediction and Model Fitting." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/07-STS242")}

  • Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/aos/1013203451")}

  • van Buuren, S. and Groothuis-Oudshoorn, K. (2011). "mice: Multivariate Imputation by Chained Equations in R." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v045.i03")}

Citation

For details, see: Kuchen, R. (2025). "MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation." \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2507.21807")} https://arxiv.org/abs/2507.21807.

See also

  • mboost: General framework for component-wise gradient boosting in R.

  • miselect: Implements MI-extensions of LASSO and elastic nets for variable selection after multiple imputation.

  • mice: Standard tool for multiple imputation of missing data.

Author(s)

Maintainer: Robert Kuchen rokuchen@uni-mainz.de

See Also

Useful links:


booami documentation built on Feb. 19, 2026, 5:07 p.m.