perform_missforest: Perform missForest iteration

Description Usage Arguments Details Value References See Also

View source: R/perform_missforest.R

Description

Perform the missForest (Stekhoven and Buehlmann, 2012) iterative procedure to impute missing data using random forests. The ranger (Wright and Ziegler, 2017) fast implementation of random forest (training) algorithm is used. Some key alterations to the missForest algorithm may be specified by the user.

Usage

1
2
3
4
perform_missforest(X_init, model, indicator, ranger_call, gibbs = F,
  tree.imp = F, boot.train = F, obs.only = T,
  stop.measure = measure_correlation, loop.limit = 10L,
  overrides = list(), clean.step = list())

Arguments

X_init

data.frame; a data set including any of numeric, logical, integer, factor and ordered data types, to be used as the initial state of the missForest procedure.

model

matrix; logical matrix which indicates inclusion of a predictor (named column) in the model of an imputed value (named row), with the order of imputation being the row order, default is a matrix of ones with rows for each partially but not-completely missing variable (in order of least to most missing), and columns for every partially complete variable.

indicator

named list; an indicator of the missing (=T) or not-missing (=F) status of the columns of X_init.

ranger_call

call; skeleton call to ranger for fitting random forests during the missForest iterative procedure, arguments can be over-ridden on a per-variable basis by overrides.

gibbs

logical; use Gibbs sampling in training steps (T) rather than the predictions from the previous iteration (default).

tree.imp

logical; use a prediction of missing data from single tree in the forest when training (T) rather than the bagged predicted value (default).

boot.train

logical; train each forest on a bootstrap sample of the observed data when T, rather than the observed data (default).

obs.only

logical; train on only observed outcomes (default) or use all data including predicted/sampled values of missing outcomes (T).

stop.measure

function; evaluates the difference or relationship between the two most recently completed data sets during iteration, must accept the following arguments;

X

named list with imputed values (in order of appearance by row) for each column in the data set;

Y

named list with imputed values (in order of appearance by row) for each column in the data set;

X_init

the original (mised-type) data set with missing values replaced as at the starting point of missForest;

indicator

a list with the missing (=T) or not missing (=F) status of the original data set;

and should return a numeric (vector), the default measure_correlation serves as an example, or see the original measure proposed by Stekhoven and Buehlmann (2012) in measure_stekhoven_2012.

loop.limit

numeric; maximum number of iterations within missForest procedure.

overrides

named list; (variable-wise) over-rides for arguments passed to ranger when training on the response variable given by the name of the item.

clean.step

named list; each item is a function to clean or post-process the named imputed data immediately after it is imputed, taking two arguments;

  • the subset of the data used in the current training step which had missing values of the named data,

  • the most recently imputed values of the named data,

and should return (post-processed) data of the same length and type as the second argument.

Details

For a full description of the missForest algorithm, see Stekhoven and Buehlmann (2012). In brief, at each iteration missing values are imputed for each variable (in the order of rownames(model)) by the predictions of a random forest trained on the observed cases of that variable along with the completed data set of the previous iteration as the value of the predictors. This is repeated until some measure of the relationship between iterations indicates convergence - usually by decreasing from the measure at the previous iteration.

Numeric data is treated as continuous and predicted by regression forests while factors are predicted via classification forests. When called from smirf only numeric (non-integer) and factor and ordered data are present (integer and logical types having been converted to factors).

The key modifications to the procedure governed by the arguments

gibbs

use the most recent predictions for each variable in training and prediction as they become available, like a Gibbs sampler by setting this to T (default is F;

obs.only

train on all rows in the data set instead of observed only by setting this to F (default is T), and;

tree.imp

predict using a randomly selected tree for each missing value rather than use the whole-of-forest aggregated prediction by setting this to T (default is F).

Collectively, these three changes make the procedure similar to the Multiple Imputation via Chained Equations of van Buuren and Groothuis-Oudshoorn, (2012).

The convergence criterion can be modified by the stop.measure argument. The default is to measure the mean rank correlation between iterations of the ordered data and the stationary rate of the categorical data (see measure_correlation. The procedure halts when both of these values are less than or equal to the previous values (see stop_condition). The original Stekhoven and Buehlmann (2012) measure is provided by the measure_stekhoven_2012 function.

Value

named list; results of the iterative procedure given as;

converged

logical; indicator of convergence;

oob_error

data.frame; variable-wise out-of-bag error at each iteration described by columns;

iteration

numeric.

variable

factor; name of column in data set.

measure

factor; one of mse (mean square error) for non-integer numeric data or pfc (proportion falsely classified).

value

numeric; out of bag error.

stop_measures

list; containing the value returned by stop.measure at each iteration.

imputed

list; each item is a named list of imputed values at each iteration, in order of appearance in X_init.

References

Stekhoven, D.J. and Buehlmann, P., 2012. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597

Van Buuren, S. and Groothuis-Oudshoorn, K., 2011. mice: Multivariate Imputation by Chained Equations in R. _Journal of Statistical Software, 45_(3). pp. 1-67. doi.10.18637/jss.v045.i03

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01

See Also

measure_correlation measure_stekhoven_2012 missForest ranger stop_condition


stephematician/miForang documentation built on July 23, 2019, 5:11 p.m.