missForest: Nonparametric Missing Value Imputation using Random Forest

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/missForest.R

Description

'missForest' is used to impute missing values particularly in the case of mixed-type data. It can be used to impute continuous and/or categorical data including complex interactions and nonlinear relations. It yields an out-of-bag (OOB) imputation error estimate. Moreover, it can be run parallel to save computation time.

Usage

1
2
3
4
5
6
missForest(xmis, maxiter = 10, ntree = 100, variablewise = FALSE,
                       decreasing = FALSE, verbose = FALSE,
                       mtry = floor(sqrt(ncol(xmis))), replace = TRUE,
                       classwt = NULL, cutoff = NULL, strata = NULL,
                       sampsize = NULL, nodesize = NULL, maxnodes = NULL,
                       xtrue = NA, parallelize = c('no', 'variables', 'forests'))

Arguments

xmis

a data matrix with missing values. The columns correspond to the variables and the rows to the observations.

maxiter

maximum number of iterations to be performed given the stopping criterion is not met beforehand.

ntree

number of trees to grow in each forest.

variablewise

logical. If 'TRUE' the OOB error is returned for each variable separately. This can be useful as a reliability check for the imputed variables w.r.t. to a subsequent data analysis.

decreasing

logical. If 'FALSE' then the variables are sorted w.r.t. increasing amount of missing entries during computation.

verbose

logical. If 'TRUE' the user is supplied with additional output between iterations, i.e., estimated imputation error, runtime and if complete data matrix is supplied the true imputation error. See 'xtrue'.

mtry

number of variables randomly sampled at each split. This argument is directly supplied to the 'randomForest' function. Note that the default value is sqrt(p) for both categorical and continuous variables where p is the number of variables in 'xmis'.

replace

logical. If 'TRUE' bootstrap sampling (with replacements) is performed else subsampling (without replacements).

classwt

list of priors of the classes in the categorical variables. This is equivalent to the randomForest argument, however, the user has to set the priors for all categorical variables in the data set (for continuous variables set it 'NULL').

cutoff

list of class cutoffs for each categorical variable. Same as with 'classwt' (for continuous variables set it '1').

strata

list of (factor) variables used for stratified sampling. Same as with 'classwt' (for continuous variables set it 'NULL').

sampsize

list of size(s) of sample to draw. This is equivalent to the randomForest argument, however, the user has to set the sizes for all variables.

nodesize

minimum size of terminal nodes. Has to be a vector of length 2, with the first entry being the number for continuous variables and the second entry the number for categorical variables. Default is 1 for continuous and 5 for categorical variables.

maxnodes

maximum number of terminal nodes for trees in the forest.

xtrue

optional. Complete data matrix. This can be supplied to test the performance. Upon providing the complete data matrix 'verbose' will show the true imputation error after each iteration and the output will also contain the final true imputation error.

parallelize

should 'missForest' be run parallel. Default is 'no'. If 'variables' the data is split into pieces of the size equal to the number of cores registered in the parallel backend. If 'forests' the total number of trees in each random forests is split in the same way. Whether 'variables' or 'forests' is more suitable, depends on the data. See Details.

Details

After each iteration the difference between the previous and the new imputed data matrix is assessed for the continuous and categorical parts. The stopping criterion is defined such that the imputation process is stopped as soon as both differences have become larger once. In case of only one type of variable the computation stops as soon as the corresponding difference goes up for the first time. However, the imputation last performed where both differences went up is generally less accurate than the previous one. Therefore, whenever the computation stops due to the stopping criterion (and not due to 'maxiter') the before last imputation matrix is returned.

The normalized root mean squared error (NRMSE) is defined as:

√{\frac{mean((X_{true} - X_{imp})^2)}{var(X_{true})}}

where X_{true} the complete data matrix, X_{imp} the imputed data matrix and 'mean'/'var' being used as short notation for the empirical mean and variance computed over the continuous missing values only.

The proportion of falsely classified (PFC) is also computed over the categorical missing values only.

For feasibility reasons 'ntree', 'mtry', 'nodesize' and 'maxnodes' can be chosen smaller. The number of trees can be chosen fairly small since growing many forests (e.g. p forests in each iteration) all observations get predicted a few times. The runtime behaves linear with 'ntree'. In case of high-dimensional data we recommend using a small 'mtry' (e.g. 100 should work) to obtain an appropriate imputation result within a feasible amount of time.

Using an appropriate backend 'missForest' can be run parallel. There are two possible ways to do this. One way is to create the random forest object in parallel (parallelize = "forests"). This is most useful if a single forest object takes long to compute and there are not many variables in the data. The second way is to compute multiple random forest classifiers parallel on different variables (parallelize = "variables"). This is most useful if the data contains many variables and computing the random forests is not taking too long. For details on how to register a parallel backend see for instance the documentation of 'doParallel').

See the vignette for further examples on how to use missForest.

I thank Steve Weston for his input regarding parallel computation of 'missForest'.

Value

ximp

imputed data matrix of same type as 'xmis'.

OOBerror

estimated OOB imputation error. For the set of continuous variables in 'xmis' the NRMSE and for the set of categorical variables the proportion of falsely classified entries is returned. See Details for the exact definition of these error measures. If 'variablewise' is set to 'TRUE' then this will be a vector of length 'p' where 'p' is the number of variables and the entries will be the OOB error for each variable separately.

error

true imputation error. This is only available if 'xtrue' was supplied. The error measures are the same as for 'OOBerror'.

Author(s)

Daniel J. Stekhoven, <[email protected]>

References

Stekhoven, D.J. and Buehlmann, P. (2012), 'MissForest - nonparametric missing value imputation for mixed-type data', Bioinformatics, 28(1) 2012, 112-118, doi: 10.1093/bioinformatics/btr597

See Also

mixError, prodNA, randomForest

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## Nonparametric missing value imputation on mixed-type data:
data(iris)
summary(iris)

## The data contains four continuous and one categorical variable.

## Artificially produce missing values using the 'prodNA' function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)

## Impute missing values providing the complete matrix for
## illustration. Use 'verbose' to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)

## The imputation is finished after five iterations having a final
## true NRMSE of 0.143 and a PFC of 0.036. The estimated final NRMSE
## is 0.157 and the PFC is 0.025 (see Details for the reason taking
## iteration 4 instead of iteration 5 as final value).

## The final results can be accessed directly. The estimated error:
iris.imp$OOBerror

## The true imputation error (if available):
iris.imp$error

## And of course the imputed data matrix (do not run this):
## iris.imp$Ximp

Example output

Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Loading required package: foreach
Loading required package: itertools
Loading required package: iterators
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.100   Min.   :0.100  
 1st Qu.:5.200   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.450   Median :1.300  
 Mean   :5.878   Mean   :3.062   Mean   :3.905   Mean   :1.222  
 3rd Qu.:6.475   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.900  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 NA's   :28      NA's   :29      NA's   :32      NA's   :33     
       Species  
 setosa    :40  
 versicolor:38  
 virginica :44  
 NA's      :28  
                
                
                
  missForest iteration 1 in progress...done!
    error(s): 0.1512033 0.03571429 
    estimated error(s): 0.1541084 0.04098361 
    difference(s): 0.01449533 0.1533333 
    time: 0.599 seconds

  missForest iteration 2 in progress...done!
    error(s): 0.1482248 0.03571429 
    estimated error(s): 0.1402145 0.03278689 
    difference(s): 9.387853e-05 0 
    time: 0.115 seconds

  missForest iteration 3 in progress...done!
    error(s): 0.1567693 0.03571429 
    estimated error(s): 0.1384038 0.04098361 
    difference(s): 6.271654e-05 0 
    time: 0.093 seconds

  missForest iteration 4 in progress...done!
    error(s): 0.1586195 0.03571429 
    estimated error(s): 0.1419132 0.04918033 
    difference(s): 3.02275e-05 0 
    time: 0.093 seconds

  missForest iteration 5 in progress...done!
    error(s): 0.1574789 0.03571429 
    estimated error(s): 0.1397179 0.04098361 
    difference(s): 4.508345e-05 0 
    time: 0.092 seconds

     NRMSE        PFC 
0.14191324 0.04918033 
     NRMSE        PFC 
0.15861953 0.03571429 

missForest documentation built on May 1, 2019, 8 p.m.