Mixgb offers a scalable solution for imputing large datasets using XGBoost, subsampling and predictive mean matching. Our method utilizes the capabilities of XGBoost, a highly efficient implementation of gradient boosted trees, to capture interactions and non-linear relations automatically. Moreover, we have integrated subsampling and predictive mean matching to minimize bias and reflect appropriate imputation variability. Our package supports various types of variables and offers flexible settings for subsampling and predictive mean matching. We also include diagnostic tools for evaluating the quality of the imputed values.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
mixgb
We first load the mixgb
package and the nhanes3_newborn
dataset, which contains 16 variables of various types
(integer/numeric/factor/ordinal factor). There are 9 variables with missing values.
library(mixgb) str(nhanes3_newborn) colSums(is.na(nhanes3_newborn))
To impute this dataset, we can use the default settings. The default
number of imputed datasets is m = 5
. Note that we do not need to convert
our data into dgCMatrix or one-hot coding format. Our package will automatically
convert it for you. Variables should be of the following types:
numeric, integer, factor or ordinal factor.
# use mixgb with default settings imputed.data <- mixgb(data = nhanes3_newborn, m = 5)
We can also customize imputation settings:
The number of imputed datasets
m
The number of imputation iterations
maxit
XGBoost hyperparameters and verbose settings.
xgb.params
, nrounds
, early_stopping_rounds
, print_every_n
and verbose
.
Subsampling ratio. By default, subsample = 0.7
. Users can change this value under the xgb.params
argument.
Predictive mean matching settings
pmm.type
, pmm.k
and pmm.link
.
Whether ordinal factors should be converted to integer (imputation process may be faster)
ordinalAsInteger
Whether or not to use bootstrapping
bootstrap
Initial imputation methods for different types of variables
initial.num
, initial.int
and initial.fac
.
Whether to save models for imputing newdata
save.models
and save.vars
.
# Use mixgb with chosen settings params <- list( max_depth = 5, subsample = 0.9, nthread = 2, tree_method = "hist" ) imputed.data <- mixgb( data = nhanes3_newborn, m = 10, maxit = 2, ordinalAsInteger = FALSE, bootstrap = FALSE, pmm.type = "auto", pmm.k = 5, pmm.link = "prob", initial.num = "normal", initial.int = "mode", initial.fac = "mode", save.models = FALSE, save.vars = NULL, xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0 )
Imputation performance can be affected by the hyperparameter settings. Although
tuning a large set of hyperparameters may appear intimidating, it is often
possible to narrowing down the search space because many hyperparameters are
correlated. In our package, the function mixgb_cv()
can be used to tune
the number of boosting rounds - nrounds
. There is no default nrounds
value
in XGBoost,
so users are required to specify this value themselves. The
default nrounds
in mixgb()
is 100. However, we recommend using
mixgb_cv()
to find the optimal nrounds
first.
params <- list(max_depth = 3, subsample = 0.7, nthread = 2) cv.results <- mixgb_cv(data = nhanes3_newborn, nrounds = 100, xgb.params = params, verbose = FALSE) cv.results$evaluation.log cv.results$response cv.results$best.nrounds
By default, mixgb_cv()
will randomly choose an incomplete variable as
the response and build an XGBoost model with other variables as explanatory
variables using the complete cases of the dataset. Therefore, each run of mixgb_cv()
will likely return different results. Users can also specify the response
and covariates in the argument response
and select_features
respectively.
cv.results <- mixgb_cv( data = nhanes3_newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1, response = "BMPHEAD", select_features = c("HSAGEIR", "HSSEX", "DMARETHN", "BMPRECUM", "BMPSB1", "BMPSB2", "BMPTR1", "BMPTR2", "BMPWT"), xgb.params = params, verbose = FALSE ) cv.results$best.nrounds
Let us just try setting nrounds = cv.results$best.nrounds
in mixgb()
to obtain 5 imputed datasets.
imputed.data <- mixgb(data = nhanes3_newborn, m = 5, nrounds = cv.results$best.nrounds)
The mixgb
package used to provide a few visual diagnostics functions. However, we have moved these functions to the vismi
package, which provides a wide range of visualisation tools for multiple imputation.
For more details, please check the vismi
package on GitHub Visualisation Tools for Multiple Imputation.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.