Description Usage Arguments Details Value References See Also
Missing data (multiple) imputation using the missForest algorithm by Stekhoven and Buehlmann (2012) (default) or, alternatively, the MICE with random forest procedure of Doove et al (2014). The ranger (Wright and Ziegler, 2017) fast implementation of random forest (training) algorithm is used.
1 2 3 4 5 | smirf(X, model = NULL, n = 5L, gibbs = F, tree.imp = F,
boot.train = F, obs.only = T, verbose = F,
X.init.fn = no_information_impute,
stop.measure = measure_correlation, loop.limit = 10L,
overrides = list(), clean.step = list(), ...)
|
X |
data.frame; a incomplete data set including any of numeric, logical, integer, factor and ordered data types. |
model |
matrix; logical matrix which indicates inclusion of a predictor (named column) in the model of an imputed value (named row), with the order of imputation being the row order, default is a matrix of ones with rows for each partially but not-completely missing variable (in order of least to most missing), and columns for every partially complete variable. |
n |
numeric scalar; the number of imputations - i.e. number of times the missForest algorithm is used. |
gibbs |
logical;
use Gibbs sampling in training steps ( |
tree.imp |
logical;
use a prediction of missing data from single tree in the forest
when training ( |
boot.train |
logical;
train each forest on a bootstrap sample of the observed data
when |
obs.only |
logical;
train on only observed outcomes (default) or use all data
including predicted/sampled values of missing outcomes ( |
verbose |
logical; print additional output. |
X.init.fn |
function; creates a completed data set to be used as the initial state of the missForest procedure given two arguments;
the default |
stop.measure |
function; evaluates the difference or relationship between the two most recently completed data sets during iteration, must accept the following arguments;
and should return a numeric (vector), the default
|
loop.limit |
numeric; maximum number of iterations within missForest procedure. |
overrides |
named list;
(variable-wise) over-rides for arguments passed to
|
clean.step |
named list; each item is a function to clean or post-process the named imputed data immediately after it is imputed, taking two arguments;
and should return (post-processed) data of the same length and type as the second argument. |
... |
further arguments passed to all calls to
|
For a full description of the missForest algorithm, see Stekhoven and Buehlmann (2012). In brief, at each iteration missing values are imputed for each variable by the predictions of a random forest trained on the observed cases of that variable using the values of predictors from the completed data set from the previous iteration. This is repeated until some measure of the relationship between iterations indicates convergence - usually by decreasing from the measure at the previous iteration.
By default the columns are imputed in the order of least missing to most
missing. This can be over-ridden by the model
argument. Columns that
are entirely missing are excluded. Non-integer numeric data is treated as
continuous and predicted by regression forests while all other data,
including integer and logical data, are predicted via classification forests.
No special treatment is given to ordered categorical data.
The call to ranger
may be modified by the ...
arguments, and any variable-specific argument to pass may be specified in the
overrides
argument.
The key modifications to the missForest procedure governed by the arguments:
gibbs
use the most recent predictions for each variable
in training and prediction as they become available, like a Gibbs
sampler by setting this to T
(default is F
);
tree.imp
predict using a randomly selected tree for each
missing value rather than use the whole-of-forest aggregated
prediction by setting this to T
(default is F
);
boot.train
train on a boot-strapped resample of the data,
(default is F
), and;
obs.only
train on all rows in the data set by setting to
F
or train on observed data only (default is T
).
Switching the first two to T
invokes a similar procedure to Multiple
Imputation via Chained Equations of Doove et al (2014). The third option
can be used to improve CI coverage (Bartlett 2014). The final option (along
with changes to the first two) will mimic van Buuren and
Groothuis-Oudshoorn (2012), except for the process for drawing values from
leaf nodes.
The convergence criterion can be modified by the stop.measure
argument. The default is to measure the mean rank correlation between
iterations of the ordered data and the stationary rate of the categorical
data (see measure_correlation
. The procedure halts when both of
these values are less than or equal to the previous values (see
stop_condition
). The original Stekhoven and Buehlmann (2012)
measure is provided by the measure_stekhoven_2012
function.
list; containing the following items;
the call used to create the multiply imputed data sets;
list where each item (numbered) is itself a named list of the output for an imputed data set;
converged
boolean convergence status;
imputed
list of imputed data by iteration and variable;
iterations
numeric count of iterations before stopping criteria met;
oob_error
list of oob error by iteration and variable;
stop_measures
output of the call to
stop.measure
at each iteration;
named list of which rows the imputed named data belong to.
Bartlett, J., 2014. 'Methodology for multiple imputation for missing data in electronic health record data', presented to _27th International Biometric Conference_, Florence, July 6-11.
Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi.10.1016/j.csda.2013.10.025
Stekhoven, D.J. and Buehlmann, P., 2012. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597
Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01
measure_correlation
measure_stekhoven_2012
stop_condition
no_information_impute
sample_impute
missForest
ranger
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.