The mixgb package provides a scalable approach to imputation for large data using XGBoost, subsampling, and predictive mean matching. It leverages XGBoost—an efficient implementation of gradient-boosted trees—to automatically capture complex interactions and non-linear relationships. Subsampling and predictive mean matching are incorporated to reduce bias and to preserve realistic imputation variability. The package accommodates a wide range of variable types and offers flexible control over subsampling and predictive matching settings.
We also recommend our package vismi (Visualisation Tools for Multiple Imputation), which offers a comprehensive set of diagnostics for assessing the quality of multiply imputed data.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
mixgbWe first load the mixgb package and the newborn dataset, which contains 16 variables of various types
(integer/numeric/factor/ordinal factor). There are 9 variables with missing values.
library(mixgb) str(newborn) colSums(is.na(newborn))
To impute this dataset, we use the default settings. By default, the number of imputed datasets is set to m = 5. The data do not need to be converted to a dgCMatrix or one-hot encoded format, as these transformations are handled automatically by the package. Supported variable types include numeric, integer, factor, and ordinal factor.
# use mixgb with default settings imp_list <- mixgb(data = newborn, m = 5)
We can also customise imputation settings:
The number of imputed datasets
m
The number of imputation iterations
maxit
XGBoost hyperparameters and verbose settings.
xgb.params, nrounds, early_stopping_rounds, print_every_n and verbose.
Subsampling ratio. By default, subsample = 0.7. Users can change this value under the xgb.params argument.
Predictive mean matching settings
pmm.type, pmm.k and pmm.link.
Whether ordinal factors should be converted to integer (imputation process may be faster)
ordinalAsInteger
Initial imputation methods for different types of variables
initial.num, initial.int and initial.fac.
Whether to save models for imputing newdata
save.models and save.vars.
set.seed(2026) # Use mixgb with chosen settings params <- list( max_depth = 5, subsample = 0.9, nthread = 2, tree_method = "hist" ) imp_list <- mixgb( data = newborn, m = 10, maxit = 2, ordinalAsInteger = FALSE, pmm.type = "auto", pmm.k = 5, pmm.link = "prob", initial.num = "normal", initial.int = "mode", initial.fac = "mode", save.models = FALSE, save.vars = NULL, xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0 )
Imputation performance can be influenced by the choice of hyperparameters. While tuning a large number of hyperparameters may seem daunting, the search space can often be substantially reduced because many of them are correlated. In mixgb, the function mixgb_cv() is provided to tune the number of boosting rounds (nrounds). As XGBoost does not define a default value for nrounds, users must specify this parameter explicitly. The default setting in mixgb() is nrounds = 100; however, we recommend using mixgb_cv() to get an appropriate value first.
params <- list(max_depth = 3, subsample = 0.7, nthread = 2) cv.results <- mixgb_cv(data = newborn, nrounds = 100, xgb.params = params, verbose = FALSE) cv.results$evaluation.log cv.results$response cv.results$best.nrounds
By default, mixgb_cv() randomly selects an incomplete variable as the response and fits an XGBoost model using the remaining variables as predictors, based on the complete cases of the dataset. As a result, repeated runs of mixgb_cv() may yield different results. Users may instead explicitly specify the response variable and the set of covariates via the response and select_features arguments, respectively.
cv.results <- mixgb_cv( data = newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1, response = "head_circumference_cm", select_features = c("age_months", "sex", "race_ethinicity", "recumbent_length_cm", "first_subscapular_skinfold_mm", "second_subscapular_skinfold_mm", "first_triceps_skinfold_mm", "second_triceps_skinfold_mm", "weight_kg"), xgb.params = params, verbose = FALSE ) cv.results$best.nrounds
We can then set nrounds = cv.results$best.nrounds in mixgb() to generate five imputed datasets.
imp_list <- mixgb(data = newborn, m = 5, nrounds = cv.results$best.nrounds)
Older version of mixgb package included a few visual diagnostic functions. These have now been removed from mixgb.
We recommend our standalone package vismi (Visualisation Tools for Multiple Imputation), which provides a comprehensive set of visual diagnostics for evaluating multiply imputed data.
For more details, please visit:
https://agnesdeng.github.io/vismi/
https://github.com/agnesdeng/vismi.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.