In bcjaeger/midy: Imputation for Predictive Analytics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

Imputation for predictive analytics (`ipa`)

The goal of ipa is to make imputation in predictive modeling workflows more straightforward and efficient. The main functions in ipa are

brew Create a container to hold your imputations
spice (optional) set parameters that govern the number of imputations for the given brew
mash fit models that will provide imputations
ferment impute missing values in training and (optionally) testing data
bottle output the imputed datasets in a tibble.

Installation

You can install the development version of ipa from GitHub with:

# install.packages("devtools")
devtools::install_github("bcjaeger/ipa")

Example: diabetes data

These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine. The data originally described 1046 participants who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. A total of 403 participants were screened for diabetes, which was defined as glycosolated hemoglobin > 7.0. Among those 403 participants, 366 provided complete data for the variables in these data.

First, we'll load some packages and set a seed for reproducibility

library(tidymodels)
library(tidyverse)
library(ipa)

data("diabetes", package = 'ipa')

diab_complete <- diabetes$complete
diab_missing <- diabetes$missing

glimpse(diab_missing)

K-nearest-neighbor imputation

K-nearest-neighbors (KNN) is a flexible and useful method for imputation of missing data. Briefly, each missing value is imputed by aggregating or randomly sampling a value from the K observations with greatest similarity to the current obervation. Conventional methods for KNN provide imputations using a single value of K, which makes it hard to identify an optimal value of K. Ideally, one would generate imputed datasets using a number of different values of K, and then use whichever K provided the most accurate imputed values or the most accurate prediction model (usually, these K are the same or similar). This is one of the things ipa does.

Generally, the ipa workflow follows the same patterned steps. Here is an example of this workflow using k-nearest-neighbors. A more detailed explanation of the steps below is written in the Getting started vignette.

nbrs_brew <- brew_nbrs(diab_missing, outcome = diabetes) %>%
  verbose_on(level = 1) %>% # makes the brew noisy
  spice(with = spicer_nbrs(k_neighbors = 1:35)) %>% # tuning parameters
  mash() %>% # creates imputed values 
  ferment() %>% # turns values into a data frame
  bottle(type = 'tibble') # turns data into tibbles, adds outcome column

nbrs_brew