title: "Using the missForest Package"
author: "Daniel J. Stekhoven"
date: "r format(Sys.Date(), '%Y-%m-%d')"
output:
pdf_document:
number_sections: true
toc: true
fontsize: 11pt
geometry: margin=2.5cm, top=3cm, bottom=2.5cm
lang: en
bibliography: myBib.bib
link-citations: true
vignette: >
%\VignetteIndexEntry{missForest}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 5.33, fig.height = 3, fig.align = "center" ) options(width = 70, prompt = "> ", continue = "+ ")
This package vignette is a practical, application-focused user guide for the R package missForest. We’ll walk through the workflow on real datasets, discuss argument choices with a keen eye on feasibility and accuracy, and keep an occasional smile. Don’t be alarmed by the length — most of it is friendly R output for illustration.
This document is not a theoretical primer on the foundations of the algorithm, nor is it a comparative study. For the theory and evaluations, see @stekhoven11.
missForest algorithm (with ranger by default)missForest is a nonparametric imputation method for basically any kind of tabular data. It handles mixed types (numeric + categorical), nonlinear relations, interactions, and even high dimensionality ((p \gg n)). For each variable with missingness, it fits a random forest on the observed part and predicts the missing part, iterating until a stopping rule is met (or maxiter says “enough”).
missForest() now uses the ranger backend for speed and multithreading.backend = "randomForest".The out-of-bag (OOB) error from the backend is transformed into an imputation error estimate — one for numeric variables (NRMSE) and one for factors (PFC). This estimate has been shown to be a good proxy of the true error @stekhoven11.
From CRAN:
install.packages("missForest", dependencies = TRUE)
The default backend (ranger) is automatically used if installed; otherwise the package will fall back to randomForest if requested.
missForestWe’ll start with a small walk-through on iris, sprinkle in performance hints, and then get fancy with parallelization for big jobs.
data(iris) (@anderson35).data(esoph) (@breslow80).eval = FALSE. See @UCI10 for details.missForest in a nutshellLoad the package:
library(missForest)
Create 10% missing values completely at random (not a lifestyle choice we endorse, but very educational):
set.seed(81) data(iris) iris.mis <- prodNA(iris, noNA = 0.1) summary(iris.mis)
Impute:
set.seed(81) iris.imp <- missForest(iris.mis) # default backend = "ranger"
The result is a list with:
iris.imp$ximp – the imputed data matrix,iris.imp$OOBerror – estimated imputation error(s).A common gotcha (we’ve all done it): use iris.imp$ximp (not iris.imp) in subsequent analyses.
iris.imp$OOBerror
Because iris has both numeric and categorical variables, you see two numbers: NRMSE (numeric) and PFC (factors). Both are better when closer to 0.
If you prefer per-variable diagnostics (for that post-imputation feature selection debate), use variablewise = TRUE:
imp_var <- missForest(iris.mis, variablewise = TRUE) imp_var$OOBerror
verbose = TRUEWant to watch it think? Switch on diagnostics:
set.seed(81) imp_verbose <- missForest(iris.mis, verbose = TRUE) imp_verbose$OOBerror
You’ll see estimated error(s), difference(s) between iterations, and time per iteration. When differences increase (by type), the algorithm returns the previous iteration’s imputation.
maxiterSometimes the stopping rule is slow to trigger (data are complicated; it happens). You can guard time with maxiter, or deliberately pick an earlier iteration.
set.seed(96) data(esoph) esoph.mis <- prodNA(esoph, noNA = 0.05) esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6) esoph.imp$OOBerror
ntree and mtryntree scales linearly with time. Defaults to 100; values in the tens often work well.mtry = floor(sqrt(p)) is a robust default, but tuning can pay off on complex data.Demonstration on a bigger matrix (timings only):
# musk <- ... # (not fetched during CRAN build) # musk.mis <- prodNA(musk, 0.05) # missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 100) # missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 20)
As you might guess, fewer trees → fewer minutes, at a modest cost in error.
replace = FALSEIf you set replace = FALSE, the sampler uses about 0.632 * n observations (otherwise OOB would vanish). Sometimes it helps, sometimes not:
set.seed(81) imp_sub <- missForest(iris.mis, replace = FALSE, verbose = TRUE) imp_sub$OOBerror
classwt, cutoff, strata, sampsizeThese let you focus the classifier for factor variables (and sampling for both types). Each is a list with one entry per variable (use NULL/1 where not applicable). A quick note on backends: For this cutoff example we explicitly use the legacy randomForest backend. The default ranger backend handles cutoffs by fitting a probability forest and then post-thresholding, but its predict() method requires passing the training data for non-quantile prediction, and a faithful OOB probability–based estimate is more involved to reproduce in a vignette. The randomForest backend natively supports per-class cutoffs and gives a clean, portable example the reader can run without extra plumbing.
# Per-variable samples: numeric use single integers; factors need a vector per class iris.sampsize <- list(12, 12, 12, 12, c(10, 15, 10)) imp_ss <- missForest(iris.mis, sampsize = iris.sampsize) # Per-class cutoffs (factor only). With ranger backend, cutoffs are emulated via probability forests. iris.cutoff <- list(1, 1, 1, 1, c(0.3, 0.6, 0.1)) imp_co <- missForest(iris.mis, cutoff = iris.cutoff, backend = "randomForest") # Class weights (factor only) iris.classwt <- list(NULL, NULL, NULL, NULL, c(10, 30, 20)) imp_cw <- missForest(iris.mis, classwt = iris.classwt)
nodesize and maxnodesnodesize is a length-2 vector: first for numeric, second for factors.
Our package defaults: c(5, 1) (yes: numeric=5, factor=1).backend = "ranger", nodesize maps to min.bucket; maxnodes is ignored (consider ranger’s max.depth if needed).backend = "randomForest", both behave as in randomForest.imp_nodes <- missForest(iris.mis, nodesize = c(5, 1))
xtrue and mixErrorIf you have a ground truth (or simulate one), supply xtrue to log the true error per iteration. The return value then includes $error.
set.seed(81) imp_bench <- missForest(iris.mis, xtrue = iris, verbose = TRUE) imp_bench$error # Or compute it later: err_manual <- mixError(imp_bench$ximp, iris.mis, iris) err_manual
parallelize and num.threadsWe offer two modes:
parallelize = "variables"
Different variables are imputed in parallel using a registered foreach backend.
To avoid nested oversubscription, per-variable ranger calls use num.threads = 1 internally.
parallelize = "forests"
A single variable’s forest is built with ranger multithreading (set num.threads) or, with randomForest, by combining sub-forests via foreach.
Register a backend first (example with doParallel):
library(doParallel) registerDoParallel(2) # Variables mode imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE) # Forests mode (ranger threading) imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE, num.threads = 2)
Which one is faster? It depends on your data and machine. Try both when in doubt (and coffee is brewing).
Imputation with missForest is straightforward, and OOB errors help you judge quality at a glance. Do remember: imputation does not add information; it helps retain partially observed rows for downstream analyses that prefer complete cases. For broader perspectives, see @schafer97 and @little87.
We thank Steve Weston for contributions regarding parallel computation ideas and tools in the R ecosystem.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.