hdImpute: A Batch Process for High Dimensional Imputation

A common part of any missing data / imputation project is checking for missingness. This is often done prior to modeling, during fitting in some cases, and after imputation to ensure missingness is sufficiently addressed. In many cases, missingness can look like it's dealt with form one view, but still be present from another. Regardless of why, there exists the need for constantly checking for missingness, and where it exists in the data throughout the process.

The latest release of hdImpute includes two helper functions to check for missingness ("NA") at the column and row levels. The functions are:

check_feature_na(): find features with (specified amount of) missingness
check_row_na(): find number of and which rows contain any missingness

These helpers aren't complex in architecture. But rather, they are intended to provide the practical information needed to assess missingness. That is, rather than add up all missing cases, or find the mean missingness in a column (both of which are still helpful), these helpers are intended to help the user find and assess by returning the amount of missingness and also where it exists in the data.

The first column-wise function, check_feature_na(), takes two inputs: data and threshold. The data is simply the dataframe under consideration (the "original" data). The threshold is the level of missingness in a given column/feature as a proportion bounded between 0 and 1. Default set to sensitive level at 1e-04. That is, how much missingness do you want to locate in a given column? Return the column name and the rate of missingness for the column, for all columns that are at the supplied threshold or greater.

The second row-wise function, check_row_na(), is similar in scope but focuses at the case / observation level. This function also takes two inputs: data and which. As before, data is the raw, original data with missingness. which is optional, and logical. Its default is set to FALSE. But if TRUE, check_row_na() then returns a list with the row numbers (indices) corresponding to each row with missingness. So, if which = FALSE (the default), then the return is simply the number of rows in data with any missingness. But if which = TRUE, then the return is a tibble containing the number of rows in data with any missingness, and a list of which rows/row numbers contain missingness.

First, load the library along with the tidyverse library for some additional helpers in setting up the sample data space.

```{r setup} library(hdImpute) library(tidyverse)


Next, set up the data and introduce missingness completely at random (MCAR) via the `prodNA()` function from the `missForest` package. Take a look at the synthetic data with missingness introduced.

```{r data}
d <- data.frame(X1 = c(1:6), 
                X2 = c(rep("A", 3), 
                       rep("B", 3)), 
                X3 = c(3:8),
                X4 = c(5:10),
                X5 = c(rep("A", 3), 
                       rep("B", 3)), 
                X6 = c(6,3,9,4,4,6))

set.seed(1234)

data <- missForest::prodNA(d, noNA = 0.30) %>% 
  as_tibble()

data

Note: This is a tiny sample set, but hopefully the usage is clear enough.

`check_feature_na()`

First, let's take a look at a few variations of the column-wise checks. The default behavior is with threshold = 1e-04, functionally asking to return a column with even a tiny amount of missingness (0.0001):

```{r col1} check_feature_na(data = data)


Now, consider an adjustment to the missingness threshold, setting `threshold = 0.5` (or 50%), which is a much less sensitive check:

```{r col2}
check_feature_na(data = data,
         threshold = 0.5)

And finally, let's verify our check is working properly by looking at the original data with no missingness. This should return 0 columns, as we set up the original data (d) to be complete:

```{r col3} check_feature_na(d)


All looks as it should.

### `check_row_na()`

Now, let's take a look at row-wise missingness checks. 

First, how many rows have missingness of any amount? (the default behavior)

```{r row1}
check_row_na(data)

Next, how many rows have missingness of any amount and which rows are they?

```{r row2} check_row_na(data, which = TRUE)


There's the list, but what are the indices? We need to `unnest()` the list to see: 

```{r row3}
check_row_na(data, which = TRUE)[2] %>%
  unnest(cols = c(which_rows))

And finally, as before, what about the original data with no missingness? (should be 0)

{r row4} check_row_na(d)

Hopefully the value of these simple helpers is clear, allowing for iterative NA checking throughout the imputation process.

This software is being actively developed, with many more features to come. Wide engagement with it and collaboration is welcomed! Here's a sampling of how to contribute: