Using the missForest Package

yaml

title: "Using the missForest Package" author: "Daniel J. Stekhoven" date: "r format(Sys.Date(), '%Y-%m-%d')" output: pdf_document: number_sections: true toc: true fontsize: 11pt geometry: margin=2.5cm, top=3cm, bottom=2.5cm lang: en bibliography: myBib.bib link-citations: true vignette: > %\VignetteIndexEntry{missForest} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}


knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE,
  fig.width = 5.33, fig.height = 3, fig.align = "center"
)
options(width = 70, prompt = "> ", continue = "+ ")

Introduction

What is this document? (And what it isn't!)

This package vignette is a practical, application-focused user guide for the R package missForest. We’ll walk through the workflow on real datasets, discuss argument choices with a keen eye on feasibility and accuracy, and keep an occasional smile. Don’t be alarmed by the length — most of it is friendly R output for illustration.

This document is not a theoretical primer on the foundations of the algorithm, nor is it a comparative study. For the theory and evaluations, see @stekhoven11.

The missForest algorithm (with ranger by default)

missForest is a nonparametric imputation method for basically any kind of tabular data. It handles mixed types (numeric + categorical), nonlinear relations, interactions, and even high dimensionality ((p \gg n)). For each variable with missingness, it fits a random forest on the observed part and predicts the missing part, iterating until a stopping rule is met (or maxiter says “enough”).

The out-of-bag (OOB) error from the backend is transformed into an imputation error estimate — one for numeric variables (NRMSE) and one for factors (PFC). This estimate has been shown to be a good proxy of the true error @stekhoven11.

Installation

From CRAN:

install.packages("missForest", dependencies = TRUE)

The default backend (ranger) is automatically used if installed; otherwise the package will fall back to randomForest if requested.


Missing value imputation with missForest

We’ll start with a small walk-through on iris, sprinkle in performance hints, and then get fancy with parallelization for big jobs.

Data for illustrations

missForest in a nutshell

Load the package:

library(missForest)

Create 10% missing values completely at random (not a lifestyle choice we endorse, but very educational):

set.seed(81)
data(iris)
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)

Impute:

set.seed(81)
iris.imp <- missForest(iris.mis)  # default backend = "ranger"

The result is a list with:

A common gotcha (we’ve all done it): use iris.imp$ximp (not iris.imp) in subsequent analyses.

iris.imp$OOBerror

Because iris has both numeric and categorical variables, you see two numbers: NRMSE (numeric) and PFC (factors). Both are better when closer to 0.

If you prefer per-variable diagnostics (for that post-imputation feature selection debate), use variablewise = TRUE:

imp_var <- missForest(iris.mis, variablewise = TRUE)
imp_var$OOBerror

Additional iteration output with verbose = TRUE

Want to watch it think? Switch on diagnostics:

set.seed(81)
imp_verbose <- missForest(iris.mis, verbose = TRUE)
imp_verbose$OOBerror

You’ll see estimated error(s), difference(s) between iterations, and time per iteration. When differences increase (by type), the algorithm returns the previous iteration’s imputation.

Changing the number of iterations: maxiter

Sometimes the stopping rule is slow to trigger (data are complicated; it happens). You can guard time with maxiter, or deliberately pick an earlier iteration.

set.seed(96)
data(esoph)
esoph.mis <- prodNA(esoph, noNA = 0.05)
esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6)
esoph.imp$OOBerror

Speed/accuracy trade-off: ntree and mtry

Demonstration on a bigger matrix (timings only):

# musk <- ...  # (not fetched during CRAN build)
# musk.mis <- prodNA(musk, 0.05)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 100)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 20)

As you might guess, fewer trees → fewer minutes, at a modest cost in error.

Subsampling instead of bootstrapping: replace = FALSE

If you set replace = FALSE, the sampler uses about 0.632 * n observations (otherwise OOB would vanish). Sometimes it helps, sometimes not:

set.seed(81)
imp_sub <- missForest(iris.mis, replace = FALSE, verbose = TRUE)
imp_sub$OOBerror

Imbalanced data & sampling controls: classwt, cutoff, strata, sampsize

These let you focus the classifier for factor variables (and sampling for both types). Each is a list with one entry per variable (use NULL/1 where not applicable). A quick note on backends: For this cutoff example we explicitly use the legacy randomForest backend. The default ranger backend handles cutoffs by fitting a probability forest and then post-thresholding, but its predict() method requires passing the training data for non-quantile prediction, and a faithful OOB probability–based estimate is more involved to reproduce in a vignette. The randomForest backend natively supports per-class cutoffs and gives a clean, portable example the reader can run without extra plumbing.

# Per-variable samples: numeric use single integers; factors need a vector per class
iris.sampsize <- list(12, 12, 12, 12, c(10, 15, 10))
imp_ss <- missForest(iris.mis, sampsize = iris.sampsize)

# Per-class cutoffs (factor only). With ranger backend, cutoffs are emulated via probability forests.
iris.cutoff <- list(1, 1, 1, 1, c(0.3, 0.6, 0.1))
imp_co <- missForest(iris.mis, cutoff = iris.cutoff, backend = "randomForest")

# Class weights (factor only)
iris.classwt <- list(NULL, NULL, NULL, NULL, c(10, 30, 20))
imp_cw <- missForest(iris.mis, classwt = iris.classwt)

Tree shape controls: nodesize and maxnodes

imp_nodes <- missForest(iris.mis, nodesize = c(5, 1))

Benchmarking with a complete matrix: xtrue and mixError

If you have a ground truth (or simulate one), supply xtrue to log the true error per iteration. The return value then includes $error.

set.seed(81)
imp_bench <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
imp_bench$error

# Or compute it later:
err_manual <- mixError(imp_bench$ximp, iris.mis, iris)
err_manual

Parallelization: parallelize and num.threads

We offer two modes:

  1. parallelize = "variables" Different variables are imputed in parallel using a registered foreach backend. To avoid nested oversubscription, per-variable ranger calls use num.threads = 1 internally.

  2. parallelize = "forests" A single variable’s forest is built with ranger multithreading (set num.threads) or, with randomForest, by combining sub-forests via foreach.

Register a backend first (example with doParallel):

library(doParallel)
registerDoParallel(2)
# Variables mode
imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)

# Forests mode (ranger threading)
imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE, num.threads = 2)

Which one is faster? It depends on your data and machine. Try both when in doubt (and coffee is brewing).


Concluding remarks

Imputation with missForest is straightforward, and OOB errors help you judge quality at a glance. Do remember: imputation does not add information; it helps retain partially observed rows for downstream analyses that prefer complete cases. For broader perspectives, see @schafer97 and @little87.

We thank Steve Weston for contributions regarding parallel computation ideas and tools in the R ecosystem.

References



Try the missForest package in your browser

Any scripts or data that you put into this service are public.

missForest documentation built on Nov. 5, 2025, 6 p.m.