README.md

smirf: Single or multiple imputation of missing data using random forests

Stephen Wade

smirf is an implementation of Stekhoven and Bühlmann's (2012) missForest algorithm. It is an iterative procedure for imputation of missing data of mixed type via fitting random forests to the observed data. It uses a fast implementation of random forests supplied by ranger (Wright and Ziegler, 2017).

Example

require(smirf)

# Add some missing values completely at random to iris
set.seed(1)

prop_missing <- 0.2
data_ <- iris
n_prod_m <- prod(dim(data_))
data_[arrayInd(sample.int(n_prod_m, size=n_prod_m * prop_missing),
               .dim=dim(data_))] <- NA

# Impute missing data - here using the original missForest stopping criterion
res <- smirf(data_, stop.measure=measure_stekhoven_2012)

Installation

Installation is easy using devtools:

library(devtools)
install_github('stephematician/smirf')

ranger and rlang libraries are also required, both are available via CRAN:

install.packages(c('ranger', 'rlang'))

Alternatives

For those looking for an alternative to this package:

  1. missRanger - uses the same underlying forest training package (ranger) as here.
  2. missForest - the original missForest package.

Details

The original missForest trains and predicts the forests via randomForest, the key differences between this implementation and the original are that;

In addition the iterative procedure can be (optionally) modified;

Combining tree-sampled missing values, Gibbs sampling and a random initial state, is like the implementation of Multiple Imputation via Chained Equations (MICE) using random forests proposed by Doove et al (2014) but for one difference - the predictions from each tree are based on the mean of the terminal node rather than a sample of the training data that belong to the terminal node.

Stopping criterion

The original stopping criterion (Stekhoven and Bühlmann, 2012) is not location and scale invariant. As an experiment, it has been replaced by a correlation based calculation by default. The user may specify their own stopping criterion, and for this purpose Stekhoven and Bühlmann's (2012) criterion is included as an example.

By default, at each iteration the (rank) correlation between the current data and the previous data is estimated for ordered/continuous data. For non-ordered data, the proportion of stationary values in categorical data is calculated. When both the mean correlation and proportion of stationary values decreases, the imputation procedure has converged.

Author's note and lament:

I am speculating that this criterion identifies when the unexplained variation of the random forest model of the (complete) data is dominating, and that 'entropy' (possibly incorrect use of this term) has been optimised. It still seems unpleasant to have incomparable measures for the different types of data. I hope that other mathematicians, more savvy than I, can investigate this.

To-do

Not exhaustive:

References

Bartlett, J., 2014. 'Methodology for multiple imputation for missing data in electronic health record data', presented to 27th International Biometric Conference, Florence, July 6-11.

Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi.10.1016/j.csda.2013.10.025

Stekhoven, D.J. and Bühlmann, P., 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01.



stephematician/miForang documentation built on July 23, 2019, 5:11 p.m.