smirf: Single of multiple imputation of missing data using random...

Description Usage Arguments Details Value References See Also

View source: R/smirf.R

Description

Missing data (multiple) imputation using the missForest algorithm by Stekhoven and Buehlmann (2012) (default) or, alternatively, the MICE with random forest procedure of Doove et al (2014). The ranger (Wright and Ziegler, 2017) fast implementation of random forest (training) algorithm is used.

Usage

1
2
3
4
5
smirf(X, model = NULL, n = 5L, gibbs = F, tree.imp = F,
  boot.train = F, obs.only = T, verbose = F,
  X.init.fn = no_information_impute,
  stop.measure = measure_correlation, loop.limit = 10L,
  overrides = list(), clean.step = list(), ...)

Arguments

X

data.frame; a incomplete data set including any of numeric, logical, integer, factor and ordered data types.

model

matrix; logical matrix which indicates inclusion of a predictor (named column) in the model of an imputed value (named row), with the order of imputation being the row order, default is a matrix of ones with rows for each partially but not-completely missing variable (in order of least to most missing), and columns for every partially complete variable.

n

numeric scalar; the number of imputations - i.e. number of times the missForest algorithm is used.

gibbs

logical; use Gibbs sampling in training steps (T) rather than the predictions from the previous iteration (default).

tree.imp

logical; use a prediction of missing data from single tree in the forest when training (T) rather than the bagged predicted value (default).

boot.train

logical; train each forest on a bootstrap sample of the observed data when T, rather than the observed data (default).

obs.only

logical; train on only observed outcomes (default) or use all data including predicted/sampled values of missing outcomes (T).

verbose

logical; print additional output.

X.init.fn

function; creates a completed data set to be used as the initial state of the missForest procedure given two arguments;

  • a data.frame

  • a list with and item indicating the missing (T) or not-missing (F) status of each column of the first argument,

the default no_information_impute serves as an example.

stop.measure

function; evaluates the difference or relationship between the two most recently completed data sets during iteration, must accept the following arguments;

X

named list with imputed values (in order of appearance by row) for each column in the data set;

Y

named list with imputed values (in order of appearance by row) for each column in the data set;

X_init

the original (mised-type) data set with missing values replaced as at the starting point of missForest;

indicator

a list with the missing (=T) or not missing (=F) status of the original data set;

and should return a numeric (vector), the default measure_correlation serves as an example, or see the original measure proposed by Stekhoven and Buehlmann (2012) in measure_stekhoven_2012.

loop.limit

numeric; maximum number of iterations within missForest procedure.

overrides

named list; (variable-wise) over-rides for arguments passed to ranger when training on the response variable given by the name of the item.

clean.step

named list; each item is a function to clean or post-process the named imputed data immediately after it is imputed, taking two arguments;

  • the subset of the data used in the current training step which had missing values of the named data,

  • the most recently imputed values of the named data,

and should return (post-processed) data of the same length and type as the second argument.

...

further arguments passed to all calls to ranger, e.g. num.trees for the number of trees in each forest.

Details

For a full description of the missForest algorithm, see Stekhoven and Buehlmann (2012). In brief, at each iteration missing values are imputed for each variable by the predictions of a random forest trained on the observed cases of that variable using the values of predictors from the completed data set from the previous iteration. This is repeated until some measure of the relationship between iterations indicates convergence - usually by decreasing from the measure at the previous iteration.

By default the columns are imputed in the order of least missing to most missing. This can be over-ridden by the model argument. Columns that are entirely missing are excluded. Non-integer numeric data is treated as continuous and predicted by regression forests while all other data, including integer and logical data, are predicted via classification forests. No special treatment is given to ordered categorical data.

The call to ranger may be modified by the ... arguments, and any variable-specific argument to pass may be specified in the overrides argument.

The key modifications to the missForest procedure governed by the arguments:

gibbs

use the most recent predictions for each variable in training and prediction as they become available, like a Gibbs sampler by setting this to T (default is F);

tree.imp

predict using a randomly selected tree for each missing value rather than use the whole-of-forest aggregated prediction by setting this to T (default is F);

boot.train

train on a boot-strapped resample of the data, (default is F), and;

obs.only

train on all rows in the data set by setting to F or train on observed data only (default is T).

Switching the first two to T invokes a similar procedure to Multiple Imputation via Chained Equations of Doove et al (2014). The third option can be used to improve CI coverage (Bartlett 2014). The final option (along with changes to the first two) will mimic van Buuren and Groothuis-Oudshoorn (2012), except for the process for drawing values from leaf nodes.

The convergence criterion can be modified by the stop.measure argument. The default is to measure the mean rank correlation between iterations of the ordered data and the stationary rate of the categorical data (see measure_correlation. The procedure halts when both of these values are less than or equal to the previous values (see stop_condition). The original Stekhoven and Buehlmann (2012) measure is provided by the measure_stekhoven_2012 function.

Value

list; containing the following items;

call

the call used to create the multiply imputed data sets;

results

list where each item (numbered) is itself a named list of the output for an imputed data set;

converged

boolean convergence status;

imputed

list of imputed data by iteration and variable;

iterations

numeric count of iterations before stopping criteria met;

oob_error

list of oob error by iteration and variable;

stop_measures

output of the call to stop.measure at each iteration;

which_imputed

named list of which rows the imputed named data belong to.

References

Bartlett, J., 2014. 'Methodology for multiple imputation for missing data in electronic health record data', presented to _27th International Biometric Conference_, Florence, July 6-11.

Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi.10.1016/j.csda.2013.10.025

Stekhoven, D.J. and Buehlmann, P., 2012. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01

See Also

measure_correlation measure_stekhoven_2012 stop_condition no_information_impute sample_impute missForest ranger


stephematician/miForang documentation built on July 23, 2019, 5:11 p.m.