# impute_multivariate: Multivariate, model-based imputation In simputation: Simple Imputation

## Description

Models that simultaneously optimize imptuation of multiple variables. Methods include imputation based on EM-estimation of multivariate normal parameters, imputation based on iterative Random Forest estimates and stochastic imptuation based on bootstrapped EM-estimatin of multivariate normal parameters.

## Usage

 ```1 2 3``` ```impute_em(dat, formula, verbose = 0, ...) impute_mf(dat, formula, ...) ```

## Arguments

 `dat` `[data.frame]` with variables to be imputed. `formula` `[formula]` imputation model description `verbose` `[numeric]` Control amount of output printed to screen. Higher values mean more output, typically per iteration. 0 or a number â‰¥q 1 for `impute_em` 0, 1, or 2 for `impute_emb` `...` Options passed to `norm::em.norm` for `impute_em` `missForest::missForest` for `impute_mf`

## Model specification

Formulas are of the form

`[IMPUTED_VARIABLES] ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ] `

When `IMPUTED_VARIABLES` is empty, every variable in `MODEL_SPECIFICATION` will be imputed. When `IMPUTED_VARIABLES` is specified, all variables in `IMPUTED_VARIABLES` and `MODEL_SPECIFICATION` are part of the model, but only the `IMPUTED_VARIABLES` are imputed in the output.

`GROUPING_VARIABLES` specify what categorical variables are used to split-impute-combine the data. Grouping using `dplyr::group_by` is also supported. If groups are defined in both the formula and using `dplyr::group_by`, the data is grouped by the union of grouping variables. Any missing value in one of the grouping variables results in an error.

## Methodology

EM-based imputation with `impute_em` only works for numerical variables. These variables are assumed to follow a multivariate normal distribution for which the means and covariance matrix is estimated based on the EM-algorithm of Dempster Laird and Rubin (1977). The imputations are the expected values for missing values, conditional on the value of the estimated parameters.

Multivariate Random Forest imputation with `impute_mf` works for numerical, categorical or mixed data types. It is based on the algorithm of Stekhoven and Buehlman (2012). Missing values are imputed using a rough guess after which a predictive random forest is trained and used to re-impute themissing values. This is iterated until convergence.

## References

Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the royal statistical society. Series B (methodological) (1977): 1-38.

Stekhoven, D.J. and Buehlmann, P., 2012. MissForestâ€”non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp.112-118.

simputation documentation built on Sept. 16, 2021, 5:11 p.m.