Using missRanger"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

Overview

The aim of this vignette is to introduce {missRanger} for imputation of missing values and to explain how to use it for multiple imputation.

{missRanger} uses the {ranger} package [@wright] to do fast missing value imputation by chained random forest. As such, it can be used as an alternative to {missForest}, a beautiful algorithm introduced in [@stekhoven]. Basically, each variable is imputed by predictions from a random forest using all other variables as covariables. The main function missRanger() iterates multiple times over all variables until the average out-of-bag prediction error of the models stops to improve.

Why should you consider {missRanger}?

In the examples below, we will meet two functions from {missRanger}:

Installation

# From CRAN
install.packages("missRanger")

# Development version
devtools::install_github("mayer79/missRanger")

Usage

We first generate a data set with about 20% missing values per column and fill them again by missRanger().

library(missRanger)

set.seed(84553)

head(iris)

# Generate data with missing values in all columns
irisWithNA <- generateNA(iris, p = 0.2)
head(irisWithNA)

# Impute missing values with missRanger
irisImputed <- missRanger(irisWithNA, num.trees = 100, verbose = 0)
head(irisImputed)

Predictive mean matching

It worked! Unfortunately, the new values look somewhat unnatural due to different rounding. If we would like to avoid this, we just set the pmm.k argument to a positive number. All imputations done during the process are then combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values:

irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, verbose = 0)
head(irisImputed)

Controlling the random forests

missRanger() offers a ... argument to pass options to ranger(), e.g. num.trees or min.node.size. How would we use its "extremely randomized trees" variant with 50 trees?

irisImputed_et <- missRanger(
  irisWithNA, 
  pmm.k = 3, 
  splitrule = "extratrees", 
  num.trees = 50, 
  verbose = 0
)
head(irisImputed_et)

It is as simple!

Use in Pipe

{missRanger} also plays well together with the pipe:

iris |>
  generateNA() |>
  missRanger(verbose = 0) |>
  head()

Two output options

Since {missRanger} 2.4.0, setting data_only = FALSE allows to not just return the imputed data, but rather a "missRanger" object containing more information.

(imp <- missRanger(irisWithNA, data_only = FALSE, verbose = 0))

# Summary
summary(imp)

Formula interface

By default missRanger() uses all columns in the data set to impute all columns with missings. To override this behaviour, you can use an intuitive formula interface: The left hand side specifies the variables to be imputed (variable names separated by a +), while the right hand side lists the variables used for imputation.

# Impute all variables with all (default behaviour). Note that variables without
# missing values will be skipped from the left hand side of the formula.
m <- missRanger(
  irisWithNA, formula = . ~ ., pmm.k = 3, num.trees = 10, seed = 1, verbose = 0
)
head(m)

# Same
m <- missRanger(irisWithNA, pmm.k = 3, num.trees = 10, seed = 1, verbose = 0)
head(m)

# Impute all variables with all except Species
m <- missRanger(irisWithNA, . ~ . - Species, pmm.k = 3, num.trees = 10, verbose = 0)
head(m)

# Impute Sepal.Width by Species 
m <- missRanger(
  irisWithNA, Sepal.Width ~ Species, pmm.k = 3, num.trees = 10, verbose = 0
)
head(m)

# No success. Why? Species contains missing values and thus can only 
# be used for imputation if it is being imputed as well
m <- missRanger(
  irisWithNA, Sepal.Width + Species ~ Species, pmm.k = 3, num.trees = 10, verbose = 0
)
head(m)

# Impute all variables univariatly
m <- missRanger(irisWithNA, . ~ 1, verbose = 0)
head(m)

Imputation takes too much time. What can I do?

missRanger() is based on iteratively fitting random forests for each variable with missing values. Since the underlying random forest implementation ranger() uses 500 trees per default, a huge number of trees might be calculated. For larger data sets, the overall process can take very long.

Here are tweaks to make things faster:

Evaluated on a normal laptop:

library(ggplot2) # for diamonds data
dim(diamonds) # 53940    10

diamonds_with_NA <- generateNA(diamonds)

# Takes 270 seconds (10 * 500 trees per iteration!)
system.time(
  m <- missRanger(diamonds_with_NA, pmm.k = 3)
)

# Takes 19 seconds
system.time(
  m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50)
)

# Takes 6 seconds
system.time(
  m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 1)
)

# Takes 9 seconds
system.time(
  m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50, sample.fraction = 0.1)
)

Trick: Use case.weights to weight down contribution of rows with many missings

Using the case.weights argument, you can pass case weights to the imputation models. This might be useful to weight down the contribution of rows with many missings.

# Count the number of non-missing values per row
non_miss <- rowSums(!is.na(irisWithNA))
table(non_miss)

# No weighting
m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5, verbose = 0)
head(m)

# Weighted by number of non-missing values per row. 
m <- missRanger(
  irisWithNA, num.trees = 20, pmm.k = 3, seed = 5, verbose = 0, case.weights = non_miss
)
head(m)

References



Try the missRanger package in your browser

Any scripts or data that you put into this service are public.

missRanger documentation built on Nov. 19, 2023, 5:14 p.m.