knitr::opts_chunk$set( collapse = TRUE , eval = T , comment = "#>" , out.width = '100%' , message = F , warning = F , echo = T )
suppressPackageStartupMessages( require(oetteR) ) suppressPackageStartupMessages( require(tidyverse) )
When you are trying to build a model the first step is to prepare your dataset and to decide which variables you should include. A common modelling problem could be that you want to predict future customer behaviour based on how they behaved in the past. Theoretically you would want to include as many variables into your analysis as you can to cover all possibilites. In selecting those variables we have to make sure that none of the variables carry any information about the future period. For example a score in our database connected to the customer could be calculated on the basis of recent events. In this case we have to look for an archived historic version of that score. The term for information from the test data (the future) that is carried into the training data (past) is leakage. We can also call it leakage if variables carry coincidental information that makes them good predictors but are unlikely to carry the same information in the future. If we for example have recruited a couple of very good customers at an event they share adjacent customer IDs and a similar recruitment date. The model is then likely to use customer ID or recruitment date as a predictor, however the event might not be repeated next year rednering those predictors useless.
One can find more information on leakage in this kaggle post
Usually leaking variables are very strong predictors. One method for detecting them is to train a few models on the preselected variables and then look at the importance of each variable for each model. In most cases a leaking variable stands out because it is a much stronger predictor than other variables and that it is almost equally important in all models tested.
We sort the data set by the response variable and create a ID based on row_number
data = ISLR::Auto %>% arrange( displacement ) %>% mutate( id = row_number() )
Here we create randomForest
,rpart
and a svm
model because these models do not require feature selection and usually performance does not decrease if you use a lot of variables. As long as we have more observations than predictors. It comes in handy that they all use the same modelling syntax so we can train the models easily in a modelling tibble.
Here we do not use pipelearner
since we do not have to do cross validation or parameter tuning at this stage we can easily build our own modelling tibble.
data_ls = f_clean_data( data, min_number_of_levels_nums = 10 ) # some of the functions I extract variables from the formula and thus only work if it is constructed without the '.' form = paste( data_ls$all_variables[ data_ls$all_variables != 'displacement' ], collapse = ' + ' ) %>% paste( 'displacement ~', . ) %>% as.formula() tib = tibble( f = c( randomForest::randomForest , rpart::rpart , e1071::svm ) , name = c('randomForest' , 'rpart' , 'e1071' ) , data = list( modelr::resample( data_ls$data, idx = c(1:nrow(data_ls$data) ) ) ) ) tib_train = tib %>% mutate( m = map2( f, data, function(f, data) f( form, as.data.frame(data) ) ) )
In oetteR
I have three functions that extract variable importance in a unified format
f_model_importance_randomForest()
f_model_importance_rpart()
f_model_importance_svm()
We also have a wrapper for those three formulas that detects the model class and calls the correct function so we can use it in a modelling tibble f_model_importance()
tib_imp = tib_train %>% mutate( imp = map(m, f_model_importance, data_ls$data ) ) tib_imp
f_model_importance_plot
returns a plotly
graph
tib_plot = tib_imp %>% mutate( plot = map2( imp, name, f_model_importance_plot) ) htmltools::tagList( tib_plot$plot )
As expected we find that the leaking variable is the best predictor for all three models
If we find other doubtfull variables in our plots we might want to invesigate them in a tableplot using f_model_importance_plot_tableplot
We have to use the tabplot:::plot.tabplot
function to properly add a title to the plot
tib_plot = tib_plot %>% mutate( tabplot = pmap( list( data, imp) , f_model_importance_plot_tableplot , response_var = 'displacement' , print = F ) ) pwalk( list( x = tib_plot$tabplot, title = tib_plot$name), tabplot:::plot.tabplot)
We might also want to know how the important variables are effecting the response in the right way. The idea is here again to test the quality of our dataset. For example does an increase in purchase frequency effect overall net spend of a customer. If this is not the case or if it is reverted the variable might have the wrong algebraic sign(+/-). For this we will simulate a artifical dataset in which all variables but one are kept at median value (or most common level). The one not kept constant will represent an evenly spaced sequence over its range. We will generate one of those grids for the most important variables and feed it into the model. Then we plot the response.
f_model_seq_range()
will generate an evenly spaced sequence of one variable over its range, for factors and numericals
f_model_data_grid()
will apply f_model_seq_range
to build a grid
f_model_plot_variable_dependency_regression()
will use both previous functions to generate a plot
f_model_data_grid(col_var = 'mpg', formula = form, data_ls = data_ls, n = 20 )
tib_plot = tib_plot %>% mutate( var_dep = map2( m, imp , f_model_plot_variable_dependency_regression , formula = form , data_ls = data_ls) ) tib_plot$var_dep %>% walk( print )
I have two functions that integrate the functions from above in a pipelearner
based workflow. They however can be adapted to other modelling dataframes/tibbles.
-f_model_importance_pl_add_plots_regression()
adds all three plot types to the dataframe. Here I use dplyr
style symbols instead of strings to denote column names as arguments. I dont really get the advantage fo symbols over using strings. Symbols have an advantage when writing code in the console, but are quite irritating to program with and write functions for. SO I never fully adapted the habit of using symbols for my functions.
-f_model_importance_pl_plots_as_html()
creates three seperate html files. Since I would like to limit this vignette to only one document I will use the webeshot
package to insert screenshots instead.
tib_plot = tib_imp %>% mutate( response_var = 'displacment' , title = name ) %>% f_model_importance_pl_add_plots_regression( data, m, imp, response_var, title , formula = form , data_ls = data_ls ) tib_plot %>% select( imp_plot, imp_tabplot, imp_plot_dep, everything() )
Note f_model_importance_pl_plots_as_html
is not so flexible when it comes to the column names. It expects a column named 'title', 'imp_plot', 'imp_plot_dep' and 'imp_tabplot'.
f_model_importance_pl_plots_as_html( tib_plot, prefix = 'vignette_' , quiet = TRUE )
webshot::webshot( 'vignette_importance_variable_dependencies.html' ) file.remove( 'vignette_importance_variable_dependencies.html' )
webshot::webshot( 'vignette_importance_plots.html' ) file.remove( 'vignette_importance_plots.html' )
webshot::webshot( 'vignette_importance_tabplots.html' ) file.remove( 'vignette_importance_tabplots.html' ) file.remove('webshot.png')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.