impute_multivariate: Multivariate, model-based imputation

impute_multivariateR Documentation

Multivariate, model-based imputation


Models that simultaneously optimize imptuation of multiple variables. Methods include imputation based on EM-estimation of multivariate normal parameters, imputation based on iterative Random Forest estimates and stochastic imptuation based on bootstrapped EM-estimatin of multivariate normal parameters.


impute_em(dat, formula, verbose = 0, ...)

impute_mf(dat, formula, ...)



[data.frame] with variables to be imputed.


[formula] imputation model description


[numeric] Control amount of output printed to screen. Higher values mean more output, typically per iteration.

  • 0 or a number ≥q 1 for impute_em

  • 0, 1, or 2 for impute_emb


Options passed to

  • norm::em.norm for impute_em

  • missForest::missForest for impute_mf

Model specification

Formulas are of the form


When IMPUTED_VARIABLES is empty, every variable in MODEL_SPECIFICATION will be imputed. When IMPUTED_VARIABLES is specified, all variables in IMPUTED_VARIABLES and MODEL_SPECIFICATION are part of the model, but only the IMPUTED_VARIABLES are imputed in the output.

GROUPING_VARIABLES specify what categorical variables are used to split-impute-combine the data. Grouping using dplyr::group_by is also supported. If groups are defined in both the formula and using dplyr::group_by, the data is grouped by the union of grouping variables. Any missing value in one of the grouping variables results in an error.


EM-based imputation with impute_em only works for numerical variables. These variables are assumed to follow a multivariate normal distribution for which the means and covariance matrix is estimated based on the EM-algorithm of Dempster Laird and Rubin (1977). The imputations are the expected values for missing values, conditional on the value of the estimated parameters.

Multivariate Random Forest imputation with impute_mf works for numerical, categorical or mixed data types. It is based on the algorithm of Stekhoven and Buehlman (2012). Missing values are imputed using a rough guess after which a predictive random forest is trained and used to re-impute themissing values. This is iterated until convergence.


Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the royal statistical society. Series B (methodological) (1977): 1-38.

Stekhoven, D.J. and Buehlmann, P., 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp.112-118.

simputation documentation built on June 16, 2022, 5:10 p.m.