mixgb: Multiple Imputation Through XGBoost
In mixgb: Multiple Imputation Through 'XGBoost'

Introduction

The mixgb package provides a scalable approach to imputation for large data using XGBoost, subsampling, and predictive mean matching. It leverages XGBoost—an efficient implementation of gradient-boosted trees—to automatically capture complex interactions and non-linear relationships. Subsampling and predictive mean matching are incorporated to reduce bias and to preserve realistic imputation variability. The package accommodates a wide range of variable types and offers flexible control over subsampling and predictive matching settings.

We also recommend our package vismi (Visualisation Tools for Multiple Imputation), which offers a comprehensive set of diagnostics for assessing the quality of multiply imputed data.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Impute missing values with `mixgb`

We first load the mixgb package and the newborn dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values.

library(mixgb)
str(newborn)
colSums(is.na(newborn))

To impute this dataset, we use the default settings. By default, the number of imputed datasets is set to m = 5. The data do not need to be converted to a dgCMatrix or one-hot encoded format, as these transformations are handled automatically by the package. Supported variable types include numeric, integer, factor, and ordinal factor.

# use mixgb with default settings
imp_list <- mixgb(data = newborn, m = 5)

Customise imputation settings

We can also customise imputation settings:

The number of imputed datasets m
The number of imputation iterations maxit
XGBoost hyperparameters and verbose settings. xgb.params, nrounds, early_stopping_rounds, print_every_n and verbose.
Subsampling ratio. By default, subsample = 0.7. Users can change this value under the xgb.params argument.
Predictive mean matching settings pmm.type, pmm.k and pmm.link.
Whether ordinal factors should be converted to integer (imputation process may be faster) ordinalAsInteger
Initial imputation methods for different types of variables initial.num, initial.int and initial.fac.
Whether to save models for imputing newdata save.models and save.vars.

set.seed(2026)
# Use mixgb with chosen settings
params <- list(
  max_depth = 5,
  subsample = 0.9,
  nthread = 2,
  tree_method = "hist"
)

imp_list <- mixgb(
  data = newborn, m = 10, maxit = 2,
  ordinalAsInteger = FALSE,
  pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
  initial.num = "normal", initial.int = "mode", initial.fac = "mode",
  save.models = FALSE, save.vars = NULL,
  xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)

Tune hyperparameters

Imputation performance can be influenced by the choice of hyperparameters. While tuning a large number of hyperparameters may seem daunting, the search space can often be substantially reduced because many of them are correlated. In mixgb, the function mixgb_cv() is provided to tune the number of boosting rounds (nrounds). As XGBoost does not define a default value for nrounds, users must specify this parameter explicitly. The default setting in mixgb() is nrounds = 100; however, we recommend using mixgb_cv() to get an appropriate value first.

params <- list(max_depth = 3, subsample = 0.7, nthread = 2)
cv.results <- mixgb_cv(data = newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results$evaluation.log
cv.results$response
cv.results$best.nrounds

By default, mixgb_cv() randomly selects an incomplete variable as the response and fits an XGBoost model using the remaining variables as predictors, based on the complete cases of the dataset. As a result, repeated runs of mixgb_cv() may yield different results. Users may instead explicitly specify the response variable and the set of covariates via the response and select_features arguments, respectively.

cv.results <- mixgb_cv(
  data = newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
  response = "head_circumference_cm", select_features = c("age_months", "sex", "race_ethinicity", "recumbent_length_cm", "first_subscapular_skinfold_mm", "second_subscapular_skinfold_mm", "first_triceps_skinfold_mm", "second_triceps_skinfold_mm", "weight_kg"), xgb.params = params, verbose = FALSE
)

cv.results$best.nrounds

We can then set nrounds = cv.results$best.nrounds in mixgb() to generate five imputed datasets.

imp_list <- mixgb(data = newborn, m = 5, nrounds = cv.results$best.nrounds)

Inspect multiply imputed values

Older version of mixgb package included a few visual diagnostic functions. These have now been removed from mixgb.

We recommend our standalone package vismi (Visualisation Tools for Multiple Imputation), which provides a comprehensive set of visual diagnostics for evaluating multiply imputed data.

For more details, please visit:

https://agnesdeng.github.io/vismi/

https://github.com/agnesdeng/vismi.