smirf: Single of multiple imputation of missing data using random...
In stephematician/miForang: Single or multiple imputation of missing data using random forests

Description Usage Arguments Details Value References See Also

Missing data (multiple) imputation using the missForest algorithm by Stekhoven and Buehlmann (2012) (default) or, alternatively, the MICE with random forest procedure of Doove et al (2014). The ranger (Wright and Ziegler, 2017) fast implementation of random forest (training) algorithm is used.

smirf(X, model = NULL, n = 5L, gibbs = F, tree.imp = F,
  boot.train = F, obs.only = T, verbose = F,
  X.init.fn = no_information_impute,
  stop.measure = measure_correlation, loop.limit = 10L,
  overrides = list(), clean.step = list(), ...)

`X`	data.frame; a incomplete data set including any of numeric, logical, integer, factor and ordered data types.
`model`	matrix; logical matrix which indicates inclusion of a predictor (named column) in the model of an imputed value (named row), with the order of imputation being the row order, default is a matrix of ones with rows for each partially but not-completely missing variable (in order of least to most missing), and columns for every partially complete variable.
`n`	numeric scalar; the number of imputations - i.e. number of times the missForest algorithm is used.
`gibbs`	logical; use Gibbs sampling in training steps (`T`) rather than the predictions from the previous iteration (default).
`tree.imp`	logical; use a prediction of missing data from single tree in the forest when training (`T`) rather than the bagged predicted value (default).
`boot.train`	logical; train each forest on a bootstrap sample of the observed data when `T`, rather than the observed data (default).
`obs.only`	logical; train on only observed outcomes (default) or use all data including predicted/sampled values of missing outcomes (`T`).
`verbose`	logical; print additional output.
`X.init.fn`	function; creates a completed data set to be used as the initial state of the missForest procedure given two arguments; a data.frame a list with and item indicating the missing (`T`) or not-missing (`F`) status of each column of the first argument, the default `no_information_impute` serves as an example.
`stop.measure`	function; evaluates the difference or relationship between the two most recently completed data sets during iteration, must accept the following arguments; `X` named list with imputed values (in order of appearance by row) for each column in the data set; `Y` named list with imputed values (in order of appearance by row) for each column in the data set; `X_init` the original (mised-type) data set with missing values replaced as at the starting point of missForest; `indicator` a list with the missing (`=T`) or not missing (`=F`) status of the original data set; and should return a numeric (vector), the default `measure_correlation` serves as an example, or see the original measure proposed by Stekhoven and Buehlmann (2012) in `measure_stekhoven_2012`.
`loop.limit`	numeric; maximum number of iterations within missForest procedure.
`overrides`	named list; (variable-wise) over-rides for arguments passed to `ranger` when training on the response variable given by the name of the item.
`clean.step`	named list; each item is a function to clean or post-process the named imputed data immediately after it is imputed, taking two arguments; the subset of the data used in the current training step which had missing values of the named data, the most recently imputed values of the named data, and should return (post-processed) data of the same length and type as the second argument.
`...`	further arguments passed to all calls to `ranger`, e.g. `num.trees` for the number of trees in each forest.

For a full description of the missForest algorithm, see Stekhoven and Buehlmann (2012). In brief, at each iteration missing values are imputed for each variable by the predictions of a random forest trained on the observed cases of that variable using the values of predictors from the completed data set from the previous iteration. This is repeated until some measure of the relationship between iterations indicates convergence - usually by decreasing from the measure at the previous iteration.

By default the columns are imputed in the order of least missing to most missing. This can be over-ridden by the model argument. Columns that are entirely missing are excluded. Non-integer numeric data is treated as continuous and predicted by regression forests while all other data, including integer and logical data, are predicted via classification forests. No special treatment is given to ordered categorical data.

The call to ranger may be modified by the ... arguments, and any variable-specific argument to pass may be specified in the overrides argument.

The key modifications to the missForest procedure governed by the arguments:

gibbs: use the most recent predictions for each variable in training and prediction as they become available, like a Gibbs sampler by setting this to T (default is F);
tree.imp: predict using a randomly selected tree for each missing value rather than use the whole-of-forest aggregated prediction by setting this to T (default is F);
boot.train: train on a boot-strapped resample of the data, (default is F), and;
obs.only: train on all rows in the data set by setting to F or train on observed data only (default is T).

Switching the first two to T invokes a similar procedure to Multiple Imputation via Chained Equations of Doove et al (2014). The third option can be used to improve CI coverage (Bartlett 2014). The final option (along with changes to the first two) will mimic van Buuren and Groothuis-Oudshoorn (2012), except for the process for drawing values from leaf nodes.

The convergence criterion can be modified by the stop.measure argument. The default is to measure the mean rank correlation between iterations of the ordered data and the stationary rate of the categorical data (see measure_correlation. The procedure halts when both of these values are less than or equal to the previous values (see stop_condition). The original Stekhoven and Buehlmann (2012) measure is provided by the measure_stekhoven_2012 function.

list; containing the following items;

call

the call used to create the multiply imputed data sets;

results

list where each item (numbered) is itself a named list of the output for an imputed data set;

converged: boolean convergence status;
imputed: list of imputed data by iteration and variable;
iterations: numeric count of iterations before stopping criteria met;
oob_error: list of oob error by iteration and variable;
stop_measures: output of the call to stop.measure at each iteration;

which_imputed

named list of which rows the imputed named data belong to.

Bartlett, J., 2014. 'Methodology for multiple imputation for missing data in electronic health record data', presented to _27th International Biometric Conference_, Florence, July 6-11.

Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi.10.1016/j.csda.2013.10.025

Stekhoven, D.J. and Buehlmann, P., 2012. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01

measure_correlation measure_stekhoven_2012 stop_condition no_information_impute sample_impute missForest ranger

stephematician/miForang documentation built on July 23, 2019, 5:11 p.m.

stephematician/miForang index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stephematician/miForang
Single or multiple imputation of missing data using random forests

smirf: Single of multiple imputation of missing data using random...
In stephematician/miForang: Single or multiple imputation of missing data using random forests

Description

Usage

Arguments

Details

Value

References

See Also

Related to smirf in stephematician/miForang...

R Package Documentation

Browse R Packages

We want your feedback!

stephematician/miForang Single or multiple imputation of missing data using random forests

smirf: Single of multiple imputation of missing data using random... In stephematician/miForang: Single or multiple imputation of missing data using random forests

Description

Usage

Arguments

Details

Value

References

See Also

Related to smirf in stephematician/miForang...

R Package Documentation

Browse R Packages

We want your feedback!

stephematician/miForang
Single or multiple imputation of missing data using random forests

smirf: Single of multiple imputation of missing data using random...
In stephematician/miForang: Single or multiple imputation of missing data using random forests