knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Overview

This vignette shows you how to create a brew object, a container for multiple imputed datasets and the primary focus of the ipa package.

Why use ipa?

The ipa package helps create multiple imputed datasets for prediction models. ipa offers classical imputation functions such as nearest neighbor imputation in addition to more scalable functions, e.g. the softImpute algorithm. ipa also allows you to assess imputed datasets in various ways:

True to its name, ipa has a workflow based around the process of brewing. The main steps involved are

  1. brew: create a container to hold your imputations

  2. spice: set primary imputation parameters

  3. mash: set secondary imputation parameters

  4. stir: fit imputation model(s) to training data

  5. ferment: Apply imputation models to testing data

  6. bottle: convert imputed values to imputed datasets (tibble or matrix)

  7. sip: evaluate accuracy of imputed values for each variable, separately. (requires observed data).

  8. chug: evaluate accuracy of downstream prediction models.

  9. distill: reduce multiple imputed sets into one (based on accuracy from sip or chug)

  10. flight: create a set of multiply imputed data optimized for prediction accuracy (in development).

As you read this list, you may think 'wow - that sounds like a dumb idea for data analysis', and I get it - whimsical programming interfaces aren't for everyone. That is why ipa also provides a family of impute functions that can do at least as much as the brew functions.

Diabetes data

These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine. The data originally described 1046 participants who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. A total of 403 participants were screened for diabetes, which was defined as glycosolated hemoglobin > 7.0. Among those 403 participants, 366 provided complete data for the variables in these data.

Problem description

We'll work with the complete (diab_complete) and incomplete (diab_incomplete) versions of the diabetes data described above. diab_incomplete has about 40% missing values in each column except for diabetes status, which is complete. Missing values occur completely at random. We'll develop a prediction model to classify whether someone has prevalent type II diabetes based on clinical predictors.

library(ipa)
library(dplyr)
library(purrr)
library(yardstick)

# data for this vignette
data("diabetes", package = 'ipa')
# will only use a subset of columns for simplicity
keep_cols <- c('diabetes', 'chol', 'hdl', 'age', 'gender', 'frame', 'waist')
# only a subset of rows have >= 2 observed value for these columns
keep_rows <- apply(
    X = diabetes$missing[, keep_cols],
    MARGIN = 1,
    FUN = function(x) sum(!is.na(x)) > 1
)
# data is a list of complete / incomplete data 
# so we need to apply changes with map
diabetes <- map(diabetes, ~.x[keep_rows, keep_cols])
diab_complete <- diabetes$complete
diab_missing <- diabetes$missing

Problem description

A common task for prediction model development is handling missing values. ipa is designed to help analysts consider more options as they engage with this task. This vignette shows some of ipa's features in the context of imputing missing values in the diabetes data. Here is a glimpse of the complete version of the data:

glimpse(diab_complete)

And here is the amputed data, missing roughly 40% of the original values.

glimpse(diab_missing)

We'll work with a training set and testing set in this example:

# random seed for reproducing these results
set.seed(730)
train_index <- sample(nrow(diab_missing), 200)

training <- diab_missing[train_index, ]
testing  <- diab_missing[-train_index, ]

Brew

ipa_brew objects are created using brew_<flavor> functions. These object are lists containing

  1. data: training data with missing values
  2. pars: tuning parameters for imputation.
  3. lims: upper and lower bounds for pars
  4. wort: a container for imputed data (initially empty)

Here we use brew_nbrs to make a brew that will use k-nearest neighbors to impute missing values.

brew_init <- brew_nbrs(data = training, outcome = diabetes)

Printing your brew (at this stage) will show the original data with missing values, along with some additional information about the brew's flavor.

brew_init

Verbose brewing

It is often helpful (especially when starting to use ipa) to receive ongoing messages that tell you what is going on with your brew. At any point in your brewing process, you can set the verbosity of your brew using verbose_on() and verbose_off(). For example,

brew_init <- verbose_on(brew_init, level = 1)

Spice

spice lets you control the primary parameters for imputing data with your brew. By primary parameters, we mean the parameters that govern how many imputed datasets are created. For nearest neighbor imputation, the primary tuning parameter is the number of nearest neighbors.

The spice function has an argument called with that depends on your brew's flavor. The general idea is to use with = spicer_<flavor> for a <flavor> brew, so we use spicer_nbrs for our nbrs brew. The advantage of using spicer functions is that, if you use Rstudio, you can hit the tab key and automatically see a list of the relevant parameters for your brew.

For brewing veterans, spice also contains a ... input that lets you specific parameters directly instead if using with = spicer_nbrs(). This lets you write more concise code if you already know what parameters go with your brew's flavor and you don't need any help remembering them. To clarify this, we use the spice function both ways below.

# how to spice if you want help remembering what parameters to use
spcd_brew <- spice(brew_init, with = spicer_nbrs(k_neighbors = c(1:50)))

# a more concise spice for those who don't need auto-completion
spcd_brew <- spice(brew_init, k_neighbors = 1:50)

# take a look at the brew's parameters (what we added using spice)
spcd_brew$pars

Mash

mash allows you to set secondary parameters for imputing data with your brew. Additionally, mash fits the models and adds their imputed values to the wort component of the brew. By secondary, we mean parameters that do not impact the number of imputed datasets created.

For a nearest neighbors brew, mash allows us to set functions that will be used to aggregate continuous, integer, or categorical values from $k$ nearest neighbors. Just like spice, we can use masher_<flavor> for a <flavor> brew, or we can write input arguments directly using ....

# how to mash if you want help remembering what parameters to use
mshd_brew <- mash(spcd_brew, with = masher_nbrs(fun_aggr_ctns = median))

# a more concise mashing call if you don't need help remembering parameters.
mshd_brew <- mash(spcd_brew, fun_aggr_ctns = median)

Stir

Once the parameters are set, it's time to fit imputation models with stir. This step will probably require more computation time compared to other steps in the brew workflow. the timer argument can be set to TRUE for this function in case you'd like to monitor time taken to fit the models used for imputation.

stirred_brew <- stir(mshd_brew, timer = TRUE)

stirred_brew

The printed output of the brew has changed. Originally, printing our brew showed the data we wanted to impute, but now printing the brew object shows a data.table containing an id column (impute), imputation parameters (k_neighbors), and imputed values for the training data (iv_training).

Notably, the brew's structure has not changed (it is still a list with four things), but print.ipa_brew() now shows brew$wort instad of brew$data. The print method for ipa_brew objects is organized this way because focus normally shifts to the imputed data after running imputation.

Ferment

ferment imputes missing values in testing data and only has three arguments, making it very easy to use:

ferm_brew <- ferment(stirred_brew, data_new = testing, timer = TRUE)

# the brew's wort now has an additional column, iv_testing,
# which contains imputed values for testing data.
ferm_brew

Under the hood, ferment does a lot of opinionated work based on the following principles:

  1. Missing data strategies should use the training data to impute missing values in the testing data and should never use the testing data to impute the training data. ipa applies two methods related to this principle:

  2. Impute with fit: Models developed from the training data are applied to create imputations for missing values in the testing data. This is the strategy applied by nearest neighbor imputation. Some imputation strategies cannot do this by design (e.g. softImpute), and ipa applies the 'stack and fit' method for these.

  3. Stack and fit: Training data are imputed using only the training data. To impute testing data, the unimputed training and testing data are stacked. The same imputation models that were used to impute training data are re-fitted to the stacked dataset, and testing data are imputed based on these models. This strategy may require a lot more computing time than 'impute with fit', and is therefore only applied when needed.

  4. Outcome columns should not be used to impute missing values in training data. In statistical inference, it is widely known that outcome columns should be used to impute missing data in predictors, but the circumstances are different for prediction models. The difference is that outcome columns will naturally be missing in testing data from the real world. By 'real world', I mean settings where the outcome hasn't happened yet, e.g., predicting a patient's 10-year risk for having a stroke. A missing data strategy that depends on accessing outcome data will not be viable here.

  5. The same modeling strategy that was used to impute training data should also be used to impute testing data. In other words, we don't recommend imputing the training data with softImpute and then imputing the testing data using kneighbors.

What if you don't agree with these principles? It is hard to break from them using functions in the brew family. However, the more flexible ipa functions (e.g., impute_nbrs, impute_soft) can break these principles if you force them.

Bottle

After imputation, your brew is ready to be bottled. The bottle function will

imputes <- bottle(ferm_brew, type = 'tibble')

imputes

Pipes and brews

The brew functions in ipa are designed to be used with the %>% operator. A typical workflow for creating a brew looks like this:

imputes <- training %>%                  # start with training data
  brew_nbrs(outcome = diabetes) %>%      # initialize brew; separate outcomes
  spice(k_neighbors = 1:50) %>%          # set primary tuning parameters
  mash(fun_aggr_ctns = median) %>%       # set secondary tuning parameters
  stir(timer = TRUE) %>%                 # impute training data
  ferment(data_new = testing) %>%        # impute testing data
  bottle(type = 'tibble')                # turn imputed values into datasets

We now have a total of 100 imputed data sets: 50 training sets and 50 testing sets using 1, 2, 3, ..., 50 neighbors to impute missing values.

In the next vignette, we will talk about sip and chug, which will help you assess your imputation strategies.



bcjaeger/ipa documentation built on May 7, 2020, 9:45 a.m.