Using the missForest Package
In missForest: Nonparametric Missing Value Imputation using Random Forest

yaml

title: "Using the missForest Package" author: "Daniel J. Stekhoven" date: "r format(Sys.Date(), '%Y-%m-%d')" output: pdf_document: number_sections: true toc: true fontsize: 11pt geometry: margin=2.5cm, top=3cm, bottom=2.5cm lang: en bibliography: myBib.bib link-citations: true vignette: > %\VignetteIndexEntry{missForest} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE,
  fig.width = 5.33, fig.height = 3, fig.align = "center"
)
options(width = 70, prompt = "> ", continue = "+ ")

Introduction

What is this document? (And what it isn't!)

This package vignette is a practical, application-focused user guide for the R package missForest. We’ll walk through the workflow on real datasets, discuss argument choices with a keen eye on feasibility and accuracy, and keep an occasional smile. Don’t be alarmed by the length — most of it is friendly R output for illustration.

This document is not a theoretical primer on the foundations of the algorithm, nor is it a comparative study. For the theory and evaluations, see @stekhoven11.

The `missForest` algorithm (with ranger by default)

missForest is a nonparametric imputation method for basically any kind of tabular data. It handles mixed types (numeric + categorical), nonlinear relations, interactions, and even high dimensionality ((p \gg n)). For each variable with missingness, it fits a random forest on the observed part and predicts the missing part, iterating until a stopping rule is met (or maxiter says “enough”).

By default, missForest() now uses the ranger backend for speed and multithreading.
For legacy/compatibility, you can select the classic randomForest backend via backend = "randomForest".

The out-of-bag (OOB) error from the backend is transformed into an imputation error estimate — one for numeric variables (NRMSE) and one for factors (PFC). This estimate has been shown to be a good proxy of the true error @stekhoven11.

Installation

From CRAN:

install.packages("missForest", dependencies = TRUE)

The default backend (ranger) is automatically used if installed; otherwise the package will fall back to randomForest if requested.

Missing value imputation with `missForest`

We’ll start with a small walk-through on iris, sprinkle in performance hints, and then get fancy with parallelization for big jobs.

Data for illustrations

Iris: Five variables, one categorical with three levels; base R: data(iris) (@anderson35).
Esoph: Oesophageal cancer case-control study; base R: data(esoph) (@breslow80).
Musk: A larger, high-dimensional example used only for timing demonstrations. Because CRAN machines shouldn’t fetch the internet, the chunks here are eval = FALSE. See @UCI10 for details.

`missForest` in a nutshell

Load the package:

library(missForest)

Create 10% missing values completely at random (not a lifestyle choice we endorse, but very educational):

set.seed(81)
data(iris)
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)

Impute:

set.seed(81)
iris.imp <- missForest(iris.mis)  # default backend = "ranger"

The result is a list with:

iris.imp$ximp – the imputed data matrix,
iris.imp$OOBerror – estimated imputation error(s).

A common gotcha (we’ve all done it): use iris.imp$ximp (not iris.imp) in subsequent analyses.

iris.imp$OOBerror

Because iris has both numeric and categorical variables, you see two numbers: NRMSE (numeric) and PFC (factors). Both are better when closer to 0.

If you prefer per-variable diagnostics (for that post-imputation feature selection debate), use variablewise = TRUE:

imp_var <- missForest(iris.mis, variablewise = TRUE)
imp_var$OOBerror

Additional iteration output with `verbose = TRUE`

Want to watch it think? Switch on diagnostics:

set.seed(81)
imp_verbose <- missForest(iris.mis, verbose = TRUE)
imp_verbose$OOBerror

You’ll see estimated error(s), difference(s) between iterations, and time per iteration. When differences increase (by type), the algorithm returns the previous iteration’s imputation.

Changing the number of iterations: `maxiter`

Sometimes the stopping rule is slow to trigger (data are complicated; it happens). You can guard time with maxiter, or deliberately pick an earlier iteration.

set.seed(96)
data(esoph)
esoph.mis <- prodNA(esoph, noNA = 0.05)
esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6)
esoph.imp$OOBerror

Speed/accuracy trade-off: `ntree` and `mtry`

ntree scales linearly with time. Defaults to 100; values in the tens often work well.
mtry = floor(sqrt(p)) is a robust default, but tuning can pay off on complex data.

Demonstration on a bigger matrix (timings only):

# musk <- ...  # (not fetched during CRAN build)
# musk.mis <- prodNA(musk, 0.05)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 100)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 20)

As you might guess, fewer trees → fewer minutes, at a modest cost in error.

Subsampling instead of bootstrapping: `replace = FALSE`

If you set replace = FALSE, the sampler uses about 0.632 * n observations (otherwise OOB would vanish). Sometimes it helps, sometimes not:

set.seed(81)
imp_sub <- missForest(iris.mis, replace = FALSE, verbose = TRUE)
imp_sub$OOBerror

Imbalanced data & sampling controls: `classwt`, `cutoff`, `strata`, `sampsize`

These let you focus the classifier for factor variables (and sampling for both types). Each is a list with one entry per variable (use NULL/1 where not applicable). A quick note on backends: For this cutoff example we explicitly use the legacy randomForest backend. The default ranger backend handles cutoffs by fitting a probability forest and then post-thresholding, but its predict() method requires passing the training data for non-quantile prediction, and a faithful OOB probability–based estimate is more involved to reproduce in a vignette. The randomForest backend natively supports per-class cutoffs and gives a clean, portable example the reader can run without extra plumbing.

# Per-variable samples: numeric use single integers; factors need a vector per class
iris.sampsize <- list(12, 12, 12, 12, c(10, 15, 10))
imp_ss <- missForest(iris.mis, sampsize = iris.sampsize)

# Per-class cutoffs (factor only). With ranger backend, cutoffs are emulated via probability forests.
iris.cutoff <- list(1, 1, 1, 1, c(0.3, 0.6, 0.1))
imp_co <- missForest(iris.mis, cutoff = iris.cutoff, backend = "randomForest")

# Class weights (factor only)
iris.classwt <- list(NULL, NULL, NULL, NULL, c(10, 30, 20))
imp_cw <- missForest(iris.mis, classwt = iris.classwt)

Tree shape controls: `nodesize` and `maxnodes`

nodesize is a length-2 vector: first for numeric, second for factors. Our package defaults: c(5, 1) (yes: numeric=5, factor=1).
With backend = "ranger", nodesize maps to min.bucket; maxnodes is ignored (consider ranger’s max.depth if needed).
With backend = "randomForest", both behave as in randomForest.

imp_nodes <- missForest(iris.mis, nodesize = c(5, 1))

Benchmarking with a complete matrix: `xtrue` and `mixError`

If you have a ground truth (or simulate one), supply xtrue to log the true error per iteration. The return value then includes $error.

set.seed(81)
imp_bench <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
imp_bench$error

# Or compute it later:
err_manual <- mixError(imp_bench$ximp, iris.mis, iris)
err_manual

Parallelization: `parallelize` and `num.threads`

We offer two modes:

parallelize = "variables" Different variables are imputed in parallel using a registered foreach backend. To avoid nested oversubscription, per-variable ranger calls use num.threads = 1 internally.
parallelize = "forests" A single variable’s forest is built with ranger multithreading (set num.threads) or, with randomForest, by combining sub-forests via foreach.

library(doParallel)
registerDoParallel(2)
# Variables mode
imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)

# Forests mode (ranger threading)
imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE, num.threads = 2)

Which one is faster? It depends on your data and machine. Try both when in doubt (and coffee is brewing).

Concluding remarks

Imputation with missForest is straightforward, and OOB errors help you judge quality at a glance. Do remember: imputation does not add information; it helps retain partially observed rows for downstream analyses that prefer complete cases. For broader perspectives, see @schafer97 and @little87.

We thank Steve Weston for contributions regarding parallel computation ideas and tools in the R ecosystem.

References

Any scripts or data that you put into this service are public.

missForest documentation built on Nov. 5, 2025, 6 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

missForest
Nonparametric Missing Value Imputation using Random Forest

Using the missForest Package
In missForest: Nonparametric Missing Value Imputation using Random Forest

yaml

Introduction

What is this document? (And what it isn't!)

The `missForest` algorithm (with ranger by default)

Installation

Missing value imputation with `missForest`

Data for illustrations

`missForest` in a nutshell

Additional iteration output with `verbose = TRUE`

Changing the number of iterations: `maxiter`

Speed/accuracy trade-off: `ntree` and `mtry`

Subsampling instead of bootstrapping: `replace = FALSE`

Imbalanced data & sampling controls: `classwt`, `cutoff`, `strata`, `sampsize`

Tree shape controls: `nodesize` and `maxnodes`

Benchmarking with a complete matrix: `xtrue` and `mixError`

Parallelization: `parallelize` and `num.threads`

Concluding remarks

References

Try the missForest package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

missForest Nonparametric Missing Value Imputation using Random Forest

Using the missForest Package In missForest: Nonparametric Missing Value Imputation using Random Forest

yaml

Introduction

What is this document? (And what it isn't!)

The missForest algorithm (with ranger by default)

Installation

Missing value imputation with missForest

Data for illustrations

missForest in a nutshell

Additional iteration output with verbose = TRUE

Changing the number of iterations: maxiter

Speed/accuracy trade-off: ntree and mtry

Subsampling instead of bootstrapping: replace = FALSE

Imbalanced data & sampling controls: classwt, cutoff, strata, sampsize

Tree shape controls: nodesize and maxnodes

Benchmarking with a complete matrix: xtrue and mixError

Parallelization: parallelize and num.threads

Concluding remarks

References

Try the missForest package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

missForest
Nonparametric Missing Value Imputation using Random Forest

Using the missForest Package
In missForest: Nonparametric Missing Value Imputation using Random Forest

The `missForest` algorithm (with ranger by default)

Missing value imputation with `missForest`

`missForest` in a nutshell

Additional iteration output with `verbose = TRUE`

Changing the number of iterations: `maxiter`

Speed/accuracy trade-off: `ntree` and `mtry`

Subsampling instead of bootstrapping: `replace = FALSE`

Imbalanced data & sampling controls: `classwt`, `cutoff`, `strata`, `sampsize`

Tree shape controls: `nodesize` and `maxnodes`

Benchmarking with a complete matrix: `xtrue` and `mixError`

Parallelization: `parallelize` and `num.threads`