missForest: Nonparametric Missing Value Imputation using Random Forests...

View source: R/missForest.R

missForestR Documentation

Nonparametric Missing Value Imputation using Random Forests (ranger or randomForest)

Description

missForest imputes missing values for mixed-type data (numeric and categorical). It models complex interactions and nonlinear relations and returns an out-of-bag (OOB) imputation error estimate. It supports parallel execution and offers two backends: ranger (default) and randomForest (legacy/compatibility).

Usage

missForest(xmis, maxiter = 10, ntree = 100, variablewise = FALSE,
           decreasing = FALSE, verbose = FALSE,
           mtry = floor(sqrt(ncol(xmis))), replace = TRUE,
           classwt = NULL, cutoff = NULL, strata = NULL,
           sampsize = NULL, nodesize = NULL, maxnodes = NULL,
           xtrue = NA, parallelize = c("no", "variables", "forests"),
           num.threads = NULL, backend = c("ranger", "randomForest"))

Arguments

xmis

A data frame or matrix with missing values. Columns are variables, rows are observations. All columns must be numeric or factor (character columns should be converted to factors beforehand).

maxiter

Maximum number of iterations unless the stopping criterion is met earlier.

ntree

Number of trees to grow in each per-variable forest.

variablewise

Logical. If TRUE, return an OOB error per variable; otherwise report one error for numeric variables (NRMSE) and one for factors (PFC).

decreasing

Logical. If FALSE, variables are processed in increasing order of missingness.

verbose

Logical. If TRUE, print iteration-wise diagnostics (estimated error, runtime, and—if xtrue is given—the true error).

mtry

Number of candidate variables at each split. Passed to the backend (randomForest or ranger). Default is \sqrt{p}.

replace

Logical. If TRUE, bootstrap sampling (with replacement) is used; otherwise subsampling (without replacement).

classwt

List of class priors for the categorical variables. Same list semantics as in randomForest: one element per variable (set NULL for numeric variables). With backend "ranger", this maps to class.weights.

cutoff

List of per-class cutoff vectors for each categorical variable. As in randomForest, one element per factor variable. With backend "ranger", cutoffs are emulated by fitting a probability forest and thresholding predicted class probabilities post-hoc.

strata

List of (factor) variables used for stratified sampling (legacy randomForest semantics). Ignored by ranger.

sampsize

List of sample sizes per variable (legacy randomForest semantics). With backend "ranger", these are converted to sample.fraction (overall or per-class fractions, as appropriate).

nodesize

Minimum node size. A numeric vector of length 2: first entry for numeric variables, second for factor variables. Default: c(5, 1). With backend "ranger", this maps to min.bucket (no exact 1:1 mapping to randomForest's terminal-node semantics).

maxnodes

Maximum number of terminal nodes per tree. Used with backend "randomForest". With "ranger", this argument is ignored (consider max.depth at the ranger level if needed).

xtrue

Optional complete data matrix for benchmarking. If provided, the iteration log includes the true imputation error, and the return value includes it as $error.

parallelize

Should missForest run in parallel? One of "no", "variables", or "forests".

"variables"

Forests for different variables are built in parallel using a registered foreach backend.

"forests"

Within a variable, the forest is built using the backend's threading (for "ranger") or via foreach sub-forests (for "randomForest").

Which choice is faster depends on data shape and backend.

num.threads

Integer (or NULL). Number of threads for ranger. If parallelize = "variables", per-variable ranger calls use num.threads = 1 internally to avoid nested oversubscription. Otherwise, if NULL, ranger's default is used. Ignored by "randomForest".

backend

Character. "ranger" (default) uses ranger for forest fitting; "randomForest" retains legacy behavior for compatibility.

Details

Algorithm. The method iteratively imputes each variable with missing values by fitting a random forest on the observed part of that variable and the current imputations of all other variables. After each iteration, the difference between the current and previous imputed matrices is computed separately for numeric and factor columns. The stopping rule is met once both differences have increased at least once (or only the present type increases if there is only one type). In that case, the previous imputation (before the increase) is returned. Otherwise, the process stops at maxiter.

Backends. With backend = "ranger", arguments are mapped as:

  • ntree -> num.trees

  • nodesize (numeric/factor) -> min.bucket for regression/classification, respectively (defaults used here are c(5, 1)).

  • sampsize (counts) -> sample.fraction (overall or per-class fractions).

  • classwt -> class.weights.

  • cutoff: emulated via probability forests and post-thresholding.

  • maxnodes: no direct equivalent in ranger (ignored).

The reported OOB error uses ranger's $prediction.error (MSE for numeric, error rate for factors), except when cutoff is used: in that case, the misclassification rate is computed by applying the cutoffs to OOB class probabilities.

Parallelization. Two modes are available via parallelize:

  • "variables": different variables are imputed in parallel using foreach; per-variable ranger calls use num.threads = 1.

  • "forests": a single variable’s forest is built using ranger multithreading (controlled by num.threads) or, for "randomForest", by combining sub-forests via foreach.

Make sure you have registered a parallel backend if you choose a parallel mode.

See the vignette for further examples and discussion.

Value

ximp

Imputed data matrix (same classes as xmis).

OOBerror

Estimated OOB imputation error. For numeric variables, the normalized root mean squared error (NRMSE); for factors, the proportion falsely classified (PFC). If variablewise = TRUE, a vector of length p with per-variable errors is returned (labeled "MSE" for numeric and "PFC" for factors).

error

True imputation error (NRMSE/PFC), present only if xtrue was given.

Author(s)

Daniel J. Stekhoven [aut, cre]

References

\insertRef

StekhovenBuehlmann2012missForest

See Also

mixError, prodNA, randomForest, ranger

Examples

## Mixed-type imputation on iris:
data(iris)
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)

## Default: ranger backend
imp_rg <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
imp_rg$OOBerror
imp_rg$error  # requires xtrue

## Legacy behavior: randomForest backend
imp_rf <- missForest(iris.mis, backend = "randomForest", verbose = TRUE)

## Parallel examples (register a backend first, e.g., doParallel):
## Not run:
# library(doParallel)
# registerDoParallel(2)
# imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)
# imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE,
#                        num.threads = 2)  # used by ranger
## End(Not run)

missForest documentation built on Nov. 5, 2025, 6 p.m.