Stephen Wade
smirf is an implementation of Stekhoven and Bühlmann's (2012) missForest
algorithm. It is an iterative procedure for imputation of missing data of mixed
type via fitting random forests to the observed data. It uses a fast
implementation of random forests supplied by ranger
(Wright and
Ziegler, 2017).
require(smirf)
# Add some missing values completely at random to iris
set.seed(1)
prop_missing <- 0.2
data_ <- iris
n_prod_m <- prod(dim(data_))
data_[arrayInd(sample.int(n_prod_m, size=n_prod_m * prop_missing),
.dim=dim(data_))] <- NA
# Impute missing data - here using the original missForest stopping criterion
res <- smirf(data_, stop.measure=measure_stekhoven_2012)
Installation is easy using devtools
:
library(devtools)
install_github('stephematician/smirf')
ranger
and rlang
libraries are also required,
both are available via CRAN:
install.packages(c('ranger', 'rlang'))
For those looking for an alternative to this package:
missRanger
- uses the same underlying forest training
package (ranger
) as here.missForest
- the original missForest package.The original missForest
trains and predicts the forests
via randomForest
, the key differences between this
implementation and the original are that;
ranger
(Wright and Ziegler,
2017), which is optimised for training on high dimensional data;mice
), and;In addition the iterative procedure can be (optionally) modified;
Combining tree-sampled missing values, Gibbs sampling and a random initial state, is like the implementation of Multiple Imputation via Chained Equations (MICE) using random forests proposed by Doove et al (2014) but for one difference - the predictions from each tree are based on the mean of the terminal node rather than a sample of the training data that belong to the terminal node.
The original stopping criterion (Stekhoven and Bühlmann, 2012) is not location and scale invariant. As an experiment, it has been replaced by a correlation based calculation by default. The user may specify their own stopping criterion, and for this purpose Stekhoven and Bühlmann's (2012) criterion is included as an example.
By default, at each iteration the (rank) correlation between the current data and the previous data is estimated for ordered/continuous data. For non-ordered data, the proportion of stationary values in categorical data is calculated. When both the mean correlation and proportion of stationary values decreases, the imputation procedure has converged.
Author's note and lament:
I am speculating that this criterion identifies when the unexplained variation of the random forest model of the (complete) data is dominating, and that 'entropy' (possibly incorrect use of this term) has been optimised. It still seems unpleasant to have incomparable measures for the different types of data. I hope that other mathematicians, more savvy than I, can investigate this.
Not exhaustive:
missForest
;missForest
;missForest
, and;missRanger
.Bartlett, J., 2014. 'Methodology for multiple imputation for missing data in electronic health record data', presented to 27th International Biometric Conference, Florence, July 6-11.
Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi.10.1016/j.csda.2013.10.025
Stekhoven, D.J. and Bühlmann, P., 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597
Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.