| missForest | R Documentation |
missForest imputes missing values for mixed-type data (numeric and
categorical). It models complex interactions and nonlinear relations and
returns an out-of-bag (OOB) imputation error estimate. It supports
parallel execution and offers two backends: ranger (default) and
randomForest (legacy/compatibility).
missForest(xmis, maxiter = 10, ntree = 100, variablewise = FALSE,
decreasing = FALSE, verbose = FALSE,
mtry = floor(sqrt(ncol(xmis))), replace = TRUE,
classwt = NULL, cutoff = NULL, strata = NULL,
sampsize = NULL, nodesize = NULL, maxnodes = NULL,
xtrue = NA, parallelize = c("no", "variables", "forests"),
num.threads = NULL, backend = c("ranger", "randomForest"))
xmis |
A data frame or matrix with missing values. Columns are variables,
rows are observations. All columns must be |
maxiter |
Maximum number of iterations unless the stopping criterion is met earlier. |
ntree |
Number of trees to grow in each per-variable forest. |
variablewise |
Logical. If |
decreasing |
Logical. If |
verbose |
Logical. If |
mtry |
Number of candidate variables at each split. Passed to the backend
(randomForest or ranger). Default is |
replace |
Logical. If |
classwt |
List of class priors for the categorical variables. Same list semantics as
in randomForest: one element per variable (set |
cutoff |
List of per-class cutoff vectors for each categorical variable. As in
randomForest, one element per factor variable. With backend
|
strata |
List of (factor) variables used for stratified sampling (legacy randomForest semantics). Ignored by ranger. |
sampsize |
List of sample sizes per variable (legacy randomForest semantics).
With backend |
nodesize |
Minimum node size. A numeric vector of length 2:
first entry for numeric variables, second for
factor variables. Default: |
maxnodes |
Maximum number of terminal nodes per tree. Used with backend
|
xtrue |
Optional complete data matrix for benchmarking. If provided, the
iteration log includes the true imputation error, and the return value
includes it as |
parallelize |
Should
Which choice is faster depends on data shape and backend. |
num.threads |
Integer (or |
backend |
Character. |
Algorithm. The method iteratively imputes each variable with missing
values by fitting a random forest on the observed part of that variable and
the current imputations of all other variables. After each iteration, the
difference between the current and previous imputed matrices is computed
separately for numeric and factor columns. The stopping rule is met once both
differences have increased at least once (or only the present type increases
if there is only one type). In that case, the previous imputation
(before the increase) is returned. Otherwise, the process stops at
maxiter.
Backends. With backend = "ranger", arguments are mapped as:
ntree -> num.trees
nodesize (numeric/factor) -> min.bucket
for regression/classification, respectively (defaults used here are
c(5, 1)).
sampsize (counts) -> sample.fraction
(overall or per-class fractions).
classwt -> class.weights.
cutoff: emulated via probability forests and post-thresholding.
maxnodes: no direct equivalent in ranger (ignored).
The reported OOB error uses ranger's $prediction.error
(MSE for numeric, error rate for factors), except when cutoff is used:
in that case, the misclassification rate is computed by applying the cutoffs
to OOB class probabilities.
Parallelization. Two modes are available via parallelize:
"variables": different variables are imputed in parallel
using foreach; per-variable ranger calls use
num.threads = 1.
"forests": a single variable’s forest is built using
ranger multithreading (controlled by num.threads) or,
for "randomForest", by combining sub-forests via foreach.
Make sure you have registered a parallel backend if you choose a parallel mode.
See the vignette for further examples and discussion.
ximp |
Imputed data matrix (same classes as |
OOBerror |
Estimated OOB imputation error. For numeric variables, the normalized
root mean squared error (NRMSE); for factors, the proportion falsely
classified (PFC). If |
error |
True imputation error (NRMSE/PFC), present only if |
Daniel J. Stekhoven [aut, cre]
StekhovenBuehlmann2012missForest
mixError, prodNA,
randomForest,
ranger
## Mixed-type imputation on iris:
data(iris)
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
## Default: ranger backend
imp_rg <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
imp_rg$OOBerror
imp_rg$error # requires xtrue
## Legacy behavior: randomForest backend
imp_rf <- missForest(iris.mis, backend = "randomForest", verbose = TRUE)
## Parallel examples (register a backend first, e.g., doParallel):
## Not run:
# library(doParallel)
# registerDoParallel(2)
# imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)
# imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE,
# num.threads = 2) # used by ranger
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.