hdImpute: A Batch Process for High Dimensional Imputation

This brief vignette walks users through the hdImpute package at a high level, covering:

The hdImpute process in individual stages (correlation matrix, flatten and rank, and impute and join)
The hdImpute process in a single stage, with minimal arguments to satisfy via the hdImpute() function, which does all three steps in a single call.

The first approach offers users a bit more flexibility in preprocessing and in the hdImpute process (e.g., storing objects as they are created, setting up timing or benchmarking along the way, etc.). The second approach is slightly more inflexible, but is more intuitive. Users simply pass the raw data object (data) and supply the batch size (batch) to the hdImpute() function. The function takes care of all stages from the first approach in a single function call.

For the stage-based approach, there are three core functions users must use:

feature_cor(): creates the correlation matrix. Note: Dependent on the size and dimensionality of the data as well as the speed of the machine, this preprocessing step could take some time. For example, in an earlier testing run, a simulated data frame of size 20000 $\times$ 3000 had a runtime of roughly 4.25 hours, while a smaller data frame of size 2519 $\times$ 1558 had a runtime of roughly 1 minute on an AWS EC2 instance with 32 cores.
flatten_mat(): flattens the correlation matrix from the previous stage, and ranks the features based on absolutely correlations. Thus, the input for flatten_mat() should be the stored output from feature_cor().
impute_batches(): creates batches based on the feature rankings from flatten_mat(), and then imputes missing values for each batch, until all batches are completed. Then, joins the batches to give a completed, imputed data set.

Consider a basic example.

First, load the library along with the tidyverse library for some additional helpers.

```{r setup, warning = FALSE, message = FALSE} library(hdImpute) library(tidyverse)


Next, set up the data and introduce missingness completely at random (MCAR) via the `prodNA()` function from the `missForest` package. 

```{r data}
d <- data.frame(X1 = c(1:6), 
                X2 = c(rep("A", 3), 
                       rep("B", 3)), 
                X3 = c(3:8),
                X4 = c(5:10),
                X5 = c(rep("A", 3), 
                       rep("B", 3)), 
                X6 = c(6,3,9,4,4,6))

set.seed(1234)

data <- missForest::prodNA(d, noNA = 0.30) %>% 
  as_tibble()

Note: This is a tiny sample set, but hopefully the usage is clear enough.

Next, follow each stage mentioned above, starting with building the correlation matrix. Of note, feature_cor() as an additional argument return_cor, which is logical. The default is FALSE, but if TRUE, the output is stored as normal, and the correlation matrix is printed in the console. For illustrative purposes, I set return_cor = TRUE.

```{r corr} all_cor <- feature_cor(data, return_cor = TRUE)


Next, flatten the matrix and order features. Similarly, `flatten_mat()` has an optional argument `return_mat`, which by default is set to `FALSE`. If `TRUE`, it prints the ranked features based on the correlation matrix. Here again, I set it to `TRUE`. 

```{r flatten}
flat_mat <- flatten_mat(all_cor,
                        return_mat = TRUE)

Finally, impute on a batch by batch basis, join and return the completed data set. There are several additional argument given the imputation model is chained random forests (built on top of missRanger, which is built on top of missForest). Of note, the pmm_k and n_trees arguments allow the user to specify the number of neighbors to search and the number of trees to used in building the random forests, respectively. Inspect the missRanger documentation for more on these if desired. The default values in impute_batches() are set at 5 and 15, respectively. Other arguments, e.g., save, if set to TRUE saves an .RDS of the list of imputed batches to the working directory. The default is set to FALSE.

Ultimately, users need only pass the original/raw data object (data), the ranked features (features) from flatten_mat(), and the batch size (batch) to the impute_batches() function. The output is the completed, imputed data set.

```{r impute} imputed1 <- impute_batches(data = data, features = flat_mat, batch = 2)


```{r out}
imputed1

Compare to our synthetic data object:

```{r comp} data


## Approach 2: Single call 

The alternative to the individual stages approach, which is slightly less flexible but also simpler, is to make a single call to a single function, `hdImpute()`. The function does everything for you. To call this function, users need only pass the raw data object (`data`, which must have at least one missing value) along with specifying the batch size (`batch`) to `hdImpute()`. The returned output is the same from calling `impute_batches()`: a complete, imputed data set that you would get from the individual stages approach previously covered. Users can of course update default argument values (e.g., `pmm_k`, `n_trees`, etc.) if so desired. But there is no need to do so for the function to work properly. 

```{r full}
imputed2 <- hdImpute(data = data,
                     batch = 2)

{r full_out} imputed2

This software is being actively developed, with many more features to come. Wide engagement with it and collaboration is welcomed! Here's a sampling of how to contribute: