Home

/

CRAN

/

missRanger

/

Using missRanger"
In missRanger: Fast Imputation of Missing Values

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

Overview

{missRanger} is a multivariate imputation algorithm based on random forests. It is a fast alternative to the beautiful 'MissForest' algorithm of @stekhoven, and uses the {ranger} package [@wright] to fit the random forests.

The algorithm iterates until the average out-of-bag (OOB) error of the forests stops improving. The missing values are filled by OOB predictions of the best iteration.

{missRanger} is fast.
It allows for out-of-sample applications.
It is intuitive: E.g., calling missRanger(data, . ~ 1) would impute all variables univariately, while missRanger(data, Species ~ Sepal.Width) would use Sepal.Width to impute Species.
It works for a variety of data types.
It combines random forest imputation with predictive mean matching. This avoids "new" values like 0.3334 in a 0-1 coded variable and helps to raise the variance of the imputations, which is especially important for multiple imputation (see additional vignettes).

Installation

# From CRAN
install.packages("missRanger")

# Development version
devtools::install_github("mayer79/missRanger")

Usage

library(missRanger)

set.seed(3)

iris_NA <- generateNA(iris, p = 0.1)
head(iris_NA)

imp <- missRanger(iris_NA, num.trees = 100)
head(imp)

Predictive mean matching

It worked, but the new values appear overly exact. To avoid this, we can add predictive mean matching (PMM) to the OOB predictions:

imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 100, verbose = 0)
head(imp)

Controlling the random forests

missRanger() offers many options. How would we use one feature per split (mtry = 1) with 200 trees?

imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 200, mtry = 1, verbose = 0)

Extended output

Setting data_only = FALSE (or keep_forests = TRUE) returns a "missRanger" object. With keep_forests = TRUE, this allows for out-of-sample applications:

imp <- missRanger(
  iris_NA, pmm.k = 5, num.trees = 100, keep_forests = TRUE, verbose = 0
)
imp

summary(imp)

# Out-of-sample application
# saveRDS(imp, file = "imputation_model.rds")
# imp <- readRDS("imputation_model.rds")
predict(imp, head(iris_NA))

Formulas

By default, missRanger() uses all columns to impute all columns with missings.

This can be modified by passing a formula: The left hand side specifies the variables to be imputed, while the right hand side lists the variables used for imputation.

# Impute all variables with all (default)
m <- missRanger(iris_NA, formula = . ~ ., pmm.k = 5, num.trees = 100, verbose = 0)

# Don't use Species for imputation
m <- missRanger(iris_NA, . ~ . - Species, pmm.k = 5, num.trees = 100, verbose = 0)

# Impute Sepal.Length by Species (or not?)
m <- missRanger(iris_NA, Sepal.Length ~ Species, pmm.k = 5, num.trees = 100)
head(m)

# Only univariate imputation was done! Why? Because Species contains missing values
# itself and needs to appear on the LHS as well:
m <- missRanger(iris_NA, Sepal.Length + Species ~ Species, pmm.k = 5, num.trees = 100)
head(m)

# Impute all variables univariately
m <- missRanger(iris_NA, . ~ 1, verbose = 0)

Speed-up things

missRanger() fits a random forest per variable and iteration. Thus, imputation can take long. Some tweaks:

Use less trees, e.g., num.trees = 100.
Use a smaller tree depth, e.g., max.depth = 6.
Use large leaves, e.g., min.node.size = 100.
Use smaller bootstrap samples, e.g., sample.fraction = 0.2.
Use less iterations, e.g., max.iter = 3.

The first three items also help to greatly reduce the size of the models, which might become relevant in out-of-sample applications with keep_forests = TRUE.

Trick: Use `case.weights` to reduce impact of rows with many missings

Using the case.weights argument, you can pass case weights to the imputation models. For instance, this allows to reduce the contribution of rows with many missings:

m <- missRanger(
  iris_NA,
  num.trees = 100,
  pmm.k = 5,
  case.weights = rowSums(!is.na(iris_NA))
)

References

Any scripts or data that you put into this service are public.

missRanger documentation built on April 4, 2025, 2:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

missRanger
Fast Imputation of Missing Values

Using missRanger"
In missRanger: Fast Imputation of Missing Values

Overview

Installation

Usage

Predictive mean matching

Controlling the random forests

Extended output

Formulas

Speed-up things

Trick: Use `case.weights` to reduce impact of rows with many missings

References

Try the missRanger package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

missRanger Fast Imputation of Missing Values

Using missRanger" In missRanger: Fast Imputation of Missing Values

Overview

Installation

Usage

Predictive mean matching

Controlling the random forests

Extended output

Formulas

Speed-up things

Trick: Use case.weights to reduce impact of rows with many missings

References

Try the missRanger package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

missRanger
Fast Imputation of Missing Values

Using missRanger"
In missRanger: Fast Imputation of Missing Values

Trick: Use `case.weights` to reduce impact of rows with many missings