README.md

imputeForest

Missing value imputation model based on randomForest and missForest

This packages implements a model based approach to missing values imputation using the random forest algorithm as a base learner. Missing value imputation is an important component of any data science analysis. The objective of a imputation is not the gain or generate data but to avoid data loss.

With this package the analyst can create an imputeForest object or model which can handle missing values as pre-processing step but can be also used to predict or impute missing values in new data. This is accomplished by creating an (un)supervised forest in order to predict the proximity of the target instance to previous analyzed instances. The proximity vector or matrix is then using to impute missing values as described by Leo Breiman.

The initial step of a proximity based missing value reconstruction is to replace missing values with the mode/median of observed values for each feature. In imputeForest the initial estimate of missing values can be done this way, at random (using each feature marginal distribution information) or by first calling the missForest algorithm. The latter solution can be seen as an extension to the missForest method that allows to predict missing values through an hybrid proximity approach. missForest is used to impute missing values then a proximity-based approach is used to predict the missing values assuming the missForest estimates as ground truth data.

Statistical validation is performed with cross-validation under MCAR assumption.

The package is being actively developed for handling missing values in the context of forensic age estimation from human skeletal remains and other anthropological tasks.



dsnavega/imputeForest documentation built on May 8, 2019, 2:43 p.m.