Description Usage Arguments Details Value References See Also
View source: R/perform_missforest.R
Perform the missForest (Stekhoven and Buehlmann, 2012) iterative procedure to impute missing data using random forests. The ranger (Wright and Ziegler, 2017) fast implementation of random forest (training) algorithm is used. Some key alterations to the missForest algorithm may be specified by the user.
1 2 3 4 | perform_missforest(X_init, model, indicator, ranger_call, gibbs = F,
tree.imp = F, boot.train = F, obs.only = T,
stop.measure = measure_correlation, loop.limit = 10L,
overrides = list(), clean.step = list())
|
X_init |
data.frame; a data set including any of numeric, logical, integer, factor and ordered data types, to be used as the initial state of the missForest procedure. |
model |
matrix; logical matrix which indicates inclusion of a predictor (named column) in the model of an imputed value (named row), with the order of imputation being the row order, default is a matrix of ones with rows for each partially but not-completely missing variable (in order of least to most missing), and columns for every partially complete variable. |
indicator |
named list;
an indicator of the missing ( |
ranger_call |
call;
skeleton call to |
gibbs |
logical;
use Gibbs sampling in training steps ( |
tree.imp |
logical;
use a prediction of missing data from single tree in the forest
when training ( |
boot.train |
logical;
train each forest on a bootstrap sample of the observed data
when |
obs.only |
logical;
train on only observed outcomes (default) or use all data
including predicted/sampled values of missing outcomes ( |
stop.measure |
function; evaluates the difference or relationship between the two most recently completed data sets during iteration, must accept the following arguments;
and should return a numeric (vector), the default
|
loop.limit |
numeric; maximum number of iterations within missForest procedure. |
overrides |
named list;
(variable-wise) over-rides for arguments passed to
|
clean.step |
named list; each item is a function to clean or post-process the named imputed data immediately after it is imputed, taking two arguments;
and should return (post-processed) data of the same length and type as the second argument. |
For a full description of the missForest algorithm, see Stekhoven and
Buehlmann (2012). In brief, at each iteration missing values are imputed for
each variable (in the order of rownames(model)
) by the predictions of
a random forest trained on the observed cases of that variable along with the
completed data set of the previous iteration as the value of the predictors.
This is repeated until some measure of the relationship between iterations
indicates convergence - usually by decreasing from the measure at the
previous iteration.
Numeric data is treated as continuous and predicted by regression forests
while factors are predicted via classification forests. When called from
smirf
only numeric (non-integer) and factor and ordered data are
present (integer and logical types having been converted to factors).
The key modifications to the procedure governed by the arguments
gibbs
use the most recent predictions for each variable
in training and prediction as they become available, like a Gibbs
sampler by setting this to T
(default is F
;
obs.only
train on all rows in the data set instead of
observed only by setting this to F
(default is T
),
and;
tree.imp
predict using a randomly selected tree for each
missing value rather than use the whole-of-forest aggregated
prediction by setting this to T
(default is F
).
Collectively, these three changes make the procedure similar to the Multiple Imputation via Chained Equations of van Buuren and Groothuis-Oudshoorn, (2012).
The convergence criterion can be modified by the stop.measure
argument. The default is to measure the mean rank correlation between
iterations of the ordered data and the stationary rate of the categorical
data (see measure_correlation
. The procedure halts when both of
these values are less than or equal to the previous values (see
stop_condition
). The original Stekhoven and Buehlmann (2012)
measure is provided by the measure_stekhoven_2012
function.
named list; results of the iterative procedure given as;
converged
logical; indicator of convergence;
oob_error
data.frame; variable-wise out-of-bag error at each iteration described by columns;
iteration
numeric.
variable
factor; name of column in data set.
measure
factor; one of mse
(mean
square error) for non-integer numeric data or
pfc
(proportion falsely classified).
value
numeric; out of bag error.
stop_measures
list; containing the value
returned by stop.measure
at each iteration.
imputed
list; each item is a named list of imputed values at each iteration, in order of appearance in X_init.
Stekhoven, D.J. and Buehlmann, P., 2012. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597
Van Buuren, S. and Groothuis-Oudshoorn, K., 2011. mice: Multivariate Imputation by Chained Equations in R. _Journal of Statistical Software, 45_(3). pp. 1-67. doi.10.18637/jss.v045.i03
Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01
measure_correlation
measure_stekhoven_2012
missForest
ranger
stop_condition
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.