missForest: Nonparametric Missing Value Imputation using Random Forests...
In missForest: Nonparametric Missing Value Imputation using Random Forest

View source: R/missForest.R

missForest

R Documentation

Nonparametric Missing Value Imputation using Random Forests (ranger or randomForest)

Description

missForest imputes missing values for mixed-type data (numeric and categorical). It models complex interactions and nonlinear relations and returns an out-of-bag (OOB) imputation error estimate. It supports parallel execution and offers two backends: ranger (default) and randomForest (legacy/compatibility).

Usage

missForest(xmis, maxiter = 10, ntree = 100, variablewise = FALSE,
           decreasing = FALSE, verbose = FALSE,
           mtry = floor(sqrt(ncol(xmis))), replace = TRUE,
           classwt = NULL, cutoff = NULL, strata = NULL,
           sampsize = NULL, nodesize = NULL, maxnodes = NULL,
           xtrue = NA, parallelize = c("no", "variables", "forests"),
           num.threads = NULL, backend = c("ranger", "randomForest"))

Arguments

`xmis`	A data frame or matrix with missing values. Columns are variables, rows are observations. All columns must be `numeric` or `factor` (character columns should be converted to factors beforehand).
`maxiter`	Maximum number of iterations unless the stopping criterion is met earlier.
`ntree`	Number of trees to grow in each per-variable forest.
`variablewise`	Logical. If `TRUE`, return an OOB error per variable; otherwise report one error for numeric variables (NRMSE) and one for factors (PFC).
`decreasing`	Logical. If `FALSE`, variables are processed in increasing order of missingness.
`verbose`	Logical. If `TRUE`, print iteration-wise diagnostics (estimated error, runtime, and—if `xtrue` is given—the true error).
`mtry`	Number of candidate variables at each split. Passed to the backend (randomForest or ranger). Default is `\sqrt{p}`.
`replace`	Logical. If `TRUE`, bootstrap sampling (with replacement) is used; otherwise subsampling (without replacement).
`classwt`	List of class priors for the categorical variables. Same list semantics as in randomForest: one element per variable (set `NULL` for numeric variables). With backend `"ranger"`, this maps to `class.weights`.
`cutoff`	List of per-class cutoff vectors for each categorical variable. As in randomForest, one element per factor variable. With backend `"ranger"`, cutoffs are emulated by fitting a probability forest and thresholding predicted class probabilities post-hoc.
`strata`	List of (factor) variables used for stratified sampling (legacy randomForest semantics). Ignored by ranger.
`sampsize`	List of sample sizes per variable (legacy randomForest semantics). With backend `"ranger"`, these are converted to `sample.fraction` (overall or per-class fractions, as appropriate).
`nodesize`	Minimum node size. A numeric vector of length 2: first entry for numeric variables, second for factor variables. Default: `c(5, 1)`. With backend `"ranger"`, this maps to `min.bucket` (no exact 1:1 mapping to randomForest's terminal-node semantics).
`maxnodes`	Maximum number of terminal nodes per tree. Used with backend `"randomForest"`. With `"ranger"`, this argument is ignored (consider `max.depth` at the ranger level if needed).
`xtrue`	Optional complete data matrix for benchmarking. If provided, the iteration log includes the true imputation error, and the return value includes it as `$error`.
`parallelize`	Should `missForest` run in parallel? One of `"no"`, `"variables"`, or `"forests"`. `"variables"` Forests for different variables are built in parallel using a registered foreach backend. `"forests"` Within a variable, the forest is built using the backend's threading (for `"ranger"`) or via foreach sub-forests (for `"randomForest"`). Which choice is faster depends on data shape and backend.
`num.threads`	Integer (or `NULL`). Number of threads for ranger. If `parallelize = "variables"`, per-variable ranger calls use `num.threads = 1` internally to avoid nested oversubscription. Otherwise, if `NULL`, ranger's default is used. Ignored by `"randomForest"`.
`backend`	Character. `"ranger"` (default) uses ranger for forest fitting; `"randomForest"` retains legacy behavior for compatibility.

Details

Algorithm. The method iteratively imputes each variable with missing values by fitting a random forest on the observed part of that variable and the current imputations of all other variables. After each iteration, the difference between the current and previous imputed matrices is computed separately for numeric and factor columns. The stopping rule is met once both differences have increased at least once (or only the present type increases if there is only one type). In that case, the previous imputation (before the increase) is returned. Otherwise, the process stops at maxiter.

Backends. With backend = "ranger", arguments are mapped as:

ntree -> num.trees
nodesize (numeric/factor) -> min.bucket for regression/classification, respectively (defaults used here are c(5, 1)).
sampsize (counts) -> sample.fraction (overall or per-class fractions).
classwt -> class.weights.
cutoff: emulated via probability forests and post-thresholding.
maxnodes: no direct equivalent in ranger (ignored).

The reported OOB error uses ranger's $prediction.error (MSE for numeric, error rate for factors), except when cutoff is used: in that case, the misclassification rate is computed by applying the cutoffs to OOB class probabilities.

Parallelization. Two modes are available via parallelize:

"variables": different variables are imputed in parallel using foreach; per-variable ranger calls use num.threads = 1.
"forests": a single variable’s forest is built using ranger multithreading (controlled by num.threads) or, for "randomForest", by combining sub-forests via foreach.

Make sure you have registered a parallel backend if you choose a parallel mode.

See the vignette for further examples and discussion.

Value

`ximp`	Imputed data matrix (same classes as `xmis`).
`OOBerror`	Estimated OOB imputation error. For numeric variables, the normalized root mean squared error (NRMSE); for factors, the proportion falsely classified (PFC). If `variablewise = TRUE`, a vector of length `p` with per-variable errors is returned (labeled `"MSE"` for numeric and `"PFC"` for factors).
`error`	True imputation error (NRMSE/PFC), present only if `xtrue` was given.

Author(s)

Daniel J. Stekhoven [aut, cre]

References

\insertRef

StekhovenBuehlmann2012missForest

Examples

## Mixed-type imputation on iris:
data(iris)
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)

## Default: ranger backend
imp_rg <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
imp_rg$OOBerror
imp_rg$error  # requires xtrue

## Legacy behavior: randomForest backend
imp_rf <- missForest(iris.mis, backend = "randomForest", verbose = TRUE)

## Parallel examples (register a backend first, e.g., doParallel):
## Not run:
# library(doParallel)
# registerDoParallel(2)
# imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)
# imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE,
#                        num.threads = 2)  # used by ranger
## End(Not run)

missForest documentation built on Nov. 5, 2025, 6 p.m.