miceRanger: Multiple Imputation by Chained Equations with Random Forests

All scripts to create the following charts can be found in scripts/.

Using artificial data, the time and performance of miceRanger, mice (method = "rf") and missForest were recorded. parlmice was used to run mice in parallel, and a parallel back end was set up for missForest. All runs used 5 cores. miceRangerPar refers to miceRanger being run with parallel = TRUE.

Scripts used:

mediumData.R
smallData.R

Data used was 10 columns, 7 numeric and 3 factors. Time recorded is in seconds.

Script used:

rowXtrees.R

Existing models can be recycled to impute new datasets with the impute function. This ends up being much faster, and (nearly) just as accurate as re-running MICE. Three charts are shown below - the first shows the accuracy of imputing data with existing models vs. re-running MICE. The second shows the performance achieved on the iris dataset for different data missingness levels, and the third shows the speed advantages of imputing new data with existing models vs re-running MICE.

In the following chart, a dataset with 15 variables (a-j numeric, k-p categorical) and 51200 rows was imputed using miceRanger. A different dataset with the same dimensions, but different data, was then imputed using the models created with miceRanger. Performance is nearly the same between the two. However, as noted below, the impute function takes much less time.

Script used:

imputeNew_Time.R

This chart was created to show the behavior of imputation performance when new data is imputed with different levels of missingness. The chart above consisted of the following procedure:

1) A miceDefs object is created with 50 imputed datasets on a dataset with 25% missing values. 2) All 50 datasets are used to impute 9 new datasets, each with different levels of missingness 3) The performance (R-squared for numerics, accuracy for categoricals) is recorded for all 450 datasets

Script used:

imputeNew.R

Time to Completion of `miceRanger` vs `impute`

Using the impute function is always faster than running MICE on the same dataset. However, in most use cases, the dataset used to create a miceDefs object will be larger than the running that will be imputed with the impute function. There is overhead associated with calling the predict S3 method in ranger related to this issue.

Imputation accuracy benchmarking with MAR, MCAR, MNAR, skewed, multimodal data.

FarrellDay/miceRanger documentation built on May 28, 2022, 10:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

FarrellDay/miceRanger
Multiple Imputation by Chained Equations with Random Forests

benchmarks/README.md
In FarrellDay/miceRanger: Multiple Imputation by Chained Equations with Random Forests

Benchmarking miceRanger

Timing - Small and Medium Data

Timing - Rows x Trees

Performance - Imputing New Data

Accuracy of Imputations from Existing Models vs New Models

Accuracy for different Missingness of data:

Time to Completion of `miceRanger` vs `impute`

To Do

R Package Documentation

Browse R Packages

We want your feedback!

FarrellDay/miceRanger Multiple Imputation by Chained Equations with Random Forests

benchmarks/README.md In FarrellDay/miceRanger: Multiple Imputation by Chained Equations with Random Forests

Benchmarking miceRanger

Timing - Small and Medium Data

Timing - Rows x Trees

Performance - Imputing New Data

Accuracy of Imputations from Existing Models vs New Models

Accuracy for different Missingness of data:

Time to Completion of miceRanger vs impute

To Do

R Package Documentation

Browse R Packages

We want your feedback!

FarrellDay/miceRanger
Multiple Imputation by Chained Equations with Random Forests

benchmarks/README.md
In FarrellDay/miceRanger: Multiple Imputation by Chained Equations with Random Forests

Time to Completion of `miceRanger` vs `impute`