mixgb: Multiple Imputation Through XGBoost

Introduction

The mixgb package provides a scalable approach to imputation for large data using XGBoost, subsampling, and predictive mean matching. It leverages XGBoost—an efficient implementation of gradient-boosted trees—to automatically capture complex interactions and non-linear relationships. Subsampling and predictive mean matching are incorporated to reduce bias and to preserve realistic imputation variability. The package accommodates a wide range of variable types and offers flexible control over subsampling and predictive matching settings.

We also recommend our package vismi (Visualisation Tools for Multiple Imputation), which offers a comprehensive set of diagnostics for assessing the quality of multiply imputed data.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Impute missing values with mixgb

We first load the mixgb package and the newborn dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values.

library(mixgb)
str(newborn)
colSums(is.na(newborn))

To impute this dataset, we use the default settings. By default, the number of imputed datasets is set to m = 5. The data do not need to be converted to a dgCMatrix or one-hot encoded format, as these transformations are handled automatically by the package. Supported variable types include numeric, integer, factor, and ordinal factor.

# use mixgb with default settings
imp_list <- mixgb(data = newborn, m = 5)

Customise imputation settings

We can also customise imputation settings:

set.seed(2026)
# Use mixgb with chosen settings
params <- list(
  max_depth = 5,
  subsample = 0.9,
  nthread = 2,
  tree_method = "hist"
)

imp_list <- mixgb(
  data = newborn, m = 10, maxit = 2,
  ordinalAsInteger = FALSE,
  pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
  initial.num = "normal", initial.int = "mode", initial.fac = "mode",
  save.models = FALSE, save.vars = NULL,
  xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)

Tune hyperparameters

Imputation performance can be influenced by the choice of hyperparameters. While tuning a large number of hyperparameters may seem daunting, the search space can often be substantially reduced because many of them are correlated. In mixgb, the function mixgb_cv() is provided to tune the number of boosting rounds (nrounds). As XGBoost does not define a default value for nrounds, users must specify this parameter explicitly. The default setting in mixgb() is nrounds = 100; however, we recommend using mixgb_cv() to get an appropriate value first.

params <- list(max_depth = 3, subsample = 0.7, nthread = 2)
cv.results <- mixgb_cv(data = newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results$evaluation.log
cv.results$response
cv.results$best.nrounds

By default, mixgb_cv() randomly selects an incomplete variable as the response and fits an XGBoost model using the remaining variables as predictors, based on the complete cases of the dataset. As a result, repeated runs of mixgb_cv() may yield different results. Users may instead explicitly specify the response variable and the set of covariates via the response and select_features arguments, respectively.

cv.results <- mixgb_cv(
  data = newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
  response = "head_circumference_cm", select_features = c("age_months", "sex", "race_ethinicity", "recumbent_length_cm", "first_subscapular_skinfold_mm", "second_subscapular_skinfold_mm", "first_triceps_skinfold_mm", "second_triceps_skinfold_mm", "weight_kg"), xgb.params = params, verbose = FALSE
)

cv.results$best.nrounds

We can then set nrounds = cv.results$best.nrounds in mixgb() to generate five imputed datasets.

imp_list <- mixgb(data = newborn, m = 5, nrounds = cv.results$best.nrounds)

Inspect multiply imputed values

Older version of mixgb package included a few visual diagnostic functions. These have now been removed from mixgb.

We recommend our standalone package vismi (Visualisation Tools for Multiple Imputation), which provides a comprehensive set of visual diagnostics for evaluating multiply imputed data.

For more details, please visit:

https://agnesdeng.github.io/vismi/

https://github.com/agnesdeng/vismi.



Try the mixgb package in your browser

Any scripts or data that you put into this service are public.

mixgb documentation built on Jan. 17, 2026, 5:07 p.m.