In bcjaeger/ipa: Imputation for Predictive Analytics

knitr::opts_chunk$set(
  collapse = TRUE,
  fig.width = 7, 
  fig.height = 5,
  comment = "#>"
)

library(ipa)
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
library(yardstick)

data("ames", package = 'ipa')

df_cplt <- ames$complete
df_miss <- ames$missing

In the previous vignette, we covered the first 6 steps in the brewing process. This vignette adds the next two steps: sipping and chugging. Also, we will

use an amputed version of the Ames housing data from the AmesHousing package, which is available on CRAN.
introduce the soft brew, which leverages the extremely fast softImpute algorithm.

First, we split our data into training / testing sets and then apply the first 6 brew steps.

# random seed for reproducing these results
set.seed(3290)
train_index <- sample(nrow(df_miss), 2000)

# data to impute
training <- df_miss[train_index, ]
testing <- df_miss[-train_index, ]

# complete data (used to evaluate accuracy of imputations)
training_cplt <- df_cplt[train_index, ]
testing_cplt  <- df_cplt[-train_index, ]

# a soft brew using softImpute algorithm

sft_brew <- training %>%              
  brew_soft(outcome = Sale_Price) %>% 
  verbose_on(level = 1) %>% 
  spice(with = spicer_soft(rank_stp_size = 10)) %>%       
  mash(with = masher_soft(bs_col.scale = FALSE)) %>%  
  stir(timer = TRUE) %>%              
  ferment(data_new = testing, timer = TRUE) %>%     
  bottle(type = 'tibble')

In this vignette we cover the final three steps, each of which are geared to help develop an accurate supervised prediction model using the brew's imputed data.

So far, we have imputed one training and testing set using 1 through 50 nearest neighbors. A reasonable question to ask is "Which neighbor specification imputed missing values most accurately?"

Sip

sip helps you check the accuracy of imputed values. Here is an obvious limitation: imputation accuracy can only be checked when you know what the imputed value should have been. Put another way, you can only check imputation accuracy if the data you imputed weren't really missing in the first place.

Nevertheless, you may still be able to draw insights about a problem by subsetting your incomplete data to contain only complete observations (using e.g. tidyr::drop_na()) and then artificially adding missing values (using e.g. mice::ampute), as we have done with the diabetes data. Once you've found a strategy that seems to work best for your simulated missing data, you can apply it to the real missing data. (A key assumption in this approach is that the simulated missing patterns match the real ones. If this does not seem plausible, chug is a viable alternative to sip).

sip lets you tell it how to score accuracy. This instruction is passed using the arguments

fun_ctns_error function to score continuous variables (doubles and integers)
fun_bnry_error function to score binary variables (2 categories)
fun_catg_error function to score categorical variables (>2 categories)

You can supply your own function for different variable types or rely on ipa's defaults, which evaluates continuous variables using an R-squared statistic and binary + categorical variables using a Kappa statistic. A benefit of the default approach is that both statistics are on the same scale and have similar interpretations.

# sip from the bottled brew.
# (two sips total to score both training and testing)
sft_brew <- sft_brew %>% 
  sip(from = training, data_complete = training_cplt)

Save the scores as a tibble outside of the brew so that they are easier to analyze.

# unnest_wider sub
# set(j = .col, value = sapply(.$pars, function(x) x[[.col]]))

impute_accuracy <- sft_brew %>% 
  pluck('wort') %>%
  as_tibble() %>% 
  unnest_wider(col = pars) %>% 
  select(lambda, rank_fit, training_score) %>% 
  unnest(cols = training_score)

impute_accuracy

Is there an ideal number of neighbors to use for the imputation of missing values in these data? A quick but limited way to investigate this is to compute an overall accuracy score for each imputed set by averaging the scores of individual variables together.

impute_accuracy_ovrl <- impute_accuracy %>% 
  group_by(lambda, rank_fit) %>% 
  summarize(score_overall = mean(score, na.rm = TRUE)) %>% 
  arrange(desc(score_overall))

ggplot(impute_accuracy_ovrl) + 
  aes(x = lambda, y = score_overall, size = rank_fit) + 
  geom_point(shape = 21, col = 'black', fill = 'red') + 
  theme_classic() +
  theme(text = element_text(size = 12)) +
  labs(y = 'Overall accuracy of imputed training values',
    x = 'Regularization of softImpute (lambda)')

It looks like using lambda = r pull(impute_accuracy_ovrl, lambda)[which.max(impute_accuracy_ovrl$score_overall)] and rank = r pull(impute_accuracy_ovrl, rank_fit)[which.max(impute_accuracy_ovrl$score_overall)] has the best overall score. We can also look at the accuracy of imputations for individual variables:

impute_accuracy %>% 
  filter(variable %in% c('Longitude', 'Neighborhood')) %>% 
  ggplot(aes(x=lambda, y=score)) + 
  geom_point() + 
  geom_smooth() + 
  facet_wrap(~variable)

Chug

In applied settings where siping isn't feasible, chug provides a more direct way to assess the quality of imputed data. Specifically, chug will fit a learning algorithm to the training set and predict outcomes in the corresponding testing set, then summarize the accuracy of those predictions in the score_ex column, which will be added to the brew's wort:

sft_brew <- chug(sft_brew)

mdl_scores <- sft_brew$wort %>%
  unnest_wider(pars) %>% 
  select(lambda, rank_fit, model_score) %>% 
  unnest_wider(model_score)

arrange(mdl_scores, desc(score_ex))

Under the hood, chug is using cv.glmnet to internally validate a regression model using each training and testing set, returning a list with

model_cv: a model tuned by cross-validation, fitted to the training data
preds_cv: the model's predicted values for internal testing data
preds_ex: the model's predicted values for external testing data
score_ex: a numeric value indicating external prediction accuracy

The type of model and model summary depend on the type of outcome in the brew. Notably, model_cv will only be returned if keep_mdl = TRUE, which must be passed into .fun_args using the net_args() helper function:

sft_brew <- chug(sft_brew, .fun_args = net_args(keep_mdl = TRUE))

net_args() will also let you dictate glmnet arguments. For example, we can adjust the value of alpha and the complexity of the final model using cmplx.

sft_brew <- chug(sft_brew, 
  .fun_args = net_args(
    alpha = 0.1, 
    cmplx = 'lambda.min'
  )
)

arrange(mdl_scores, desc(score_ex))

chug allows you to provide your own function with arguments that include .trn, .tst, and outcome. Below we write and use a function to fit and validate a classical logistic regression model:

# a function to fit logit model, and predict on new data
fit_lm <- function(.trn, .tst, outcome){
  lm(formula = Sale_Price ~ Longitude + Latitude, data = .trn) %>% 
    predict(.tst, type = 'response') %>% 
    rsq_trad_vec(truth = .tst$Sale_Price, estimate = .)
}

sft_brew <- chug(sft_brew, .fun = fit_lm)

model_scores <- sft_brew$wort %>% 
  unnest_wider(pars) %>% 
  select(lambda, rank_fit, model_score) %>% 
  mutate(model_score = as.numeric(model_score)) %>% 
  arrange(model_score)

The model scores are graphed here, showing a maximum AUC statistic when r model_scores$k_neighbors[which.max(model_scores$model_score)] neighbors are used.

ggplot(model_scores) + 
  aes(x = lambda, y = model_score) + 
  geom_smooth(method = 'lm', col = 'grey90', formula = y~poly(x,3)) +
  geom_point(shape = 21, size = 2.4, col = 'black', fill = 'red') + 
  theme_classic() +
  theme(text = element_text(size = 12)) +
  labs(y = 'Variability explained in external data',
    x = 'Regularization of softImpute (lambda)')