knitr::opts_chunk$set(
  collapse = TRUE
  , eval = T
  , comment = "#>"
  , out.width = '100%'
  , message = F
  , warning = F
  , echo = T
)
suppressPackageStartupMessages( require(oetteR) )
suppressPackageStartupMessages( require(tidyverse) )

Introduction Leakage

When you are trying to build a model the first step is to prepare your dataset and to decide which variables you should include. A common modelling problem could be that you want to predict future customer behaviour based on how they behaved in the past. Theoretically you would want to include as many variables into your analysis as you can to cover all possibilites. In selecting those variables we have to make sure that none of the variables carry any information about the future period. For example a score in our database connected to the customer could be calculated on the basis of recent events. In this case we have to look for an archived historic version of that score. The term for information from the test data (the future) that is carried into the training data (past) is leakage. We can also call it leakage if variables carry coincidental information that makes them good predictors but are unlikely to carry the same information in the future. If we for example have recruited a couple of very good customers at an event they share adjacent customer IDs and a similar recruitment date. The model is then likely to use customer ID or recruitment date as a predictor, however the event might not be repeated next year rednering those predictors useless.

One can find more information on leakage in this kaggle post

Detecting Leakage

Usually leaking variables are very strong predictors. One method for detecting them is to train a few models on the preselected variables and then look at the importance of each variable for each model. In most cases a leaking variable stands out because it is a much stronger predictor than other variables and that it is almost equally important in all models tested.

Generating a leaking variable

We sort the data set by the response variable and create a ID based on row_number

data = ISLR::Auto %>%
  arrange( displacement ) %>%
  mutate( id = row_number() )

Creating Regression models

Here we create randomForest,rpart and a svm model because these models do not require feature selection and usually performance does not decrease if you use a lot of variables. As long as we have more observations than predictors. It comes in handy that they all use the same modelling syntax so we can train the models easily in a modelling tibble.

Here we do not use pipelearner since we do not have to do cross validation or parameter tuning at this stage we can easily build our own modelling tibble.

data_ls = f_clean_data( data, min_number_of_levels_nums = 10 )

# some of the functions I extract variables from the formula and thus only work if it is constructed without the '.'
form =  paste( data_ls$all_variables[ data_ls$all_variables != 'displacement' ], collapse = ' + ' ) %>%
  paste( 'displacement ~', . ) %>%
  as.formula()


tib = tibble( f = c( randomForest::randomForest
                      , rpart::rpart 
                      , e1071::svm )
              , name = c('randomForest'
                       , 'rpart'
                       , 'e1071' )
              , data = list( modelr::resample( data_ls$data, idx = c(1:nrow(data_ls$data) ) ) ) 
              )


tib_train = tib %>%
  mutate( m = map2( f, data, function(f, data) f( form, as.data.frame(data) ) ) )

Adding variable importance

In oetteR I have three functions that extract variable importance in a unified format

We also have a wrapper for those three formulas that detects the model class and calls the correct function so we can use it in a modelling tibble f_model_importance()

tib_imp = tib_train %>%
  mutate( imp = map(m, f_model_importance, data_ls$data ) )

tib_imp

Investigating variable importance

Bar Plots

f_model_importance_plot returns a plotly graph

tib_plot = tib_imp %>%
  mutate( plot = map2( imp, name, f_model_importance_plot) )


htmltools::tagList( tib_plot$plot )

As expected we find that the leaking variable is the best predictor for all three models

Tabplots

If we find other doubtfull variables in our plots we might want to invesigate them in a tableplot using f_model_importance_plot_tableplot

We have to use the tabplot:::plot.tabplot function to properly add a title to the plot

tib_plot = tib_plot %>%
  mutate(  tabplot = pmap( list( data, imp)
                          , f_model_importance_plot_tableplot
                          , response_var = 'displacement'
                          , print = F
                          )  )


pwalk( list( x = tib_plot$tabplot, title = tib_plot$name), tabplot:::plot.tabplot)

Variable Dependency

We might also want to know how the important variables are effecting the response in the right way. The idea is here again to test the quality of our dataset. For example does an increase in purchase frequency effect overall net spend of a customer. If this is not the case or if it is reverted the variable might have the wrong algebraic sign(+/-). For this we will simulate a artifical dataset in which all variables but one are kept at median value (or most common level). The one not kept constant will represent an evenly spaced sequence over its range. We will generate one of those grids for the most important variables and feed it into the model. Then we plot the response.

f_model_data_grid(col_var = 'mpg', formula = form, data_ls = data_ls, n = 20 )
tib_plot = tib_plot %>%
  mutate( var_dep = map2( m, imp
                          , f_model_plot_variable_dependency_regression
                          , formula = form
                          , data_ls = data_ls) )


tib_plot$var_dep %>%
  walk( print )

Wrap everything up

I have two functions that integrate the functions from above in a pipelearner based workflow. They however can be adapted to other modelling dataframes/tibbles.

-f_model_importance_pl_add_plots_regression() adds all three plot types to the dataframe. Here I use dplyr style symbols instead of strings to denote column names as arguments. I dont really get the advantage fo symbols over using strings. Symbols have an advantage when writing code in the console, but are quite irritating to program with and write functions for. SO I never fully adapted the habit of using symbols for my functions.

-f_model_importance_pl_plots_as_html() creates three seperate html files. Since I would like to limit this vignette to only one document I will use the webeshot package to insert screenshots instead.

Add plots

tib_plot = tib_imp %>%
  mutate( response_var = 'displacment'
          , title = name ) %>%
  f_model_importance_pl_add_plots_regression( data, m, imp, response_var, title
                                              , formula = form
                                              , data_ls = data_ls
                                              )


tib_plot %>%
  select( imp_plot, imp_tabplot, imp_plot_dep, everything() )

Create Htmls

Note f_model_importance_pl_plots_as_html is not so flexible when it comes to the column names. It expects a column named 'title', 'imp_plot', 'imp_plot_dep' and 'imp_tabplot'.

f_model_importance_pl_plots_as_html( tib_plot, prefix = 'vignette_' , quiet = TRUE )
webshot::webshot( 'vignette_importance_variable_dependencies.html'  )

file.remove( 'vignette_importance_variable_dependencies.html' )
webshot::webshot( 'vignette_importance_plots.html'  )

file.remove( 'vignette_importance_plots.html' )
webshot::webshot( 'vignette_importance_tabplots.html'  )

file.remove( 'vignette_importance_tabplots.html' )

file.remove('webshot.png')


erblast/oetteR documentation built on May 27, 2019, 12:11 p.m.