knitr::opts_chunk$set( collapse = TRUE , eval = T , comment = "#>" , out.width = '30%' , message = F , warning = F , echo = T )
suppressPackageStartupMessages( require(oetteR) ) suppressPackageStartupMessages( require(tidyverse) )
Classically regression models are judged by their performance based on a calculated performance metric like rtSME
and usually we will pick the least complex model with the best performance. However the best performance is often given by complex nonlinear models that are difficult to interpret in terms of what aspects of the data they are capturing, which makes it more difficult to trust them. Here certain visualisations can help us to judge the model and build up more confidence.
In linear regression we can rely on a simple easily interpretable function to judge the impact of one predictor variable on the response variable. We can set up a empirical method to come up with a similar visualisation. For most of the statistical models we have methods to determine variable importance. So we generally have an idea which variables to investigate. The trick is to generate artifical data where we change the values of only one predictor while keeping all other predictors constant at their median.
We are using the pipelearner
package. We want the models to be trained on the full data set. Unfortunately we can only do this in pipelearner if we set the test section to a really low value.
data = ISLR::Auto %>% mutate(idx = row_number() ) data_ls = f_clean_data( data, id_cols = 'idx' ) form = displacement~horsepower+cylinders+weight+acceleration+mpg+year variable_color_code = f_plot_color_code_variables(data_ls) limit = 10 pl = pipelearner::pipelearner( data_ls$data ) %>% pipelearner::learn_models( rpart::rpart, form ) %>% pipelearner::learn_models( randomForest::randomForest, form ) %>% pipelearner::learn_models( e1071::svm, form ) %>% pipelearner::learn_cvpairs( pipelearner::crossv_mc, test = 0.000001, n = 1 ) %>% pipelearner::learn()
pl = pl %>% mutate( imp = map2(fit, train, f_model_importance) ) pl$imp
pl = pl %>% mutate( plot = pmap( list( m = fit, ranked_variables = imp, title = model, data = train) , .f = f_model_plot_variable_dependency_regression , formula = form , data_ls = data_ls , variable_color_code = variable_color_code , limit = limit ) ) pl$plot %>% walk( print )
These plots are already quite informative. However we know the models are capaple of capturing nonlinear relationships meaning that the changes in response variable based on changes in one predictor can be quite different depending on the other predictors. For example the change in displacement in response to acceleration could be different even opposite in small cars compared to large cars
Here we could start to split the data into groups and repeat the above exercise for each group. We can either split the data on some notion we already have or simply split on the most important variable.
pl = pl %>% mutate( range_var = map_chr(imp, function(x) head(x,1)$row_names ) , grid = pmap( list( m = fit , title = model , variables = imp , range_variable = range_var ) , f_model_plot_var_dep_over_spec_var_range , formula = form , data_ls = data_ls , data = data_ls$data , variable_color_code = variable_color_code , log_y = F , limit = 12 ) ) pl$grid %>% walk( gridExtra::grid.arrange )
We can also put the dataset into the modelling space and trace observations that are particularly bad good or differently predicted by a group of models.
pl_pred = pl %>% f_predict_pl_regression( cols_id = 'idx', data_test = 'train') %>% unnest(preds) %>% mutate( title = model ) %>% group_by( title ) %>% arrange( idx ) %>% ungroup() f_plot_hist( 'resid', f_clean_data_no_changes(pl_pred) ) f_plot_hist( 'resid', f_clean_data_no_changes(pl_pred), group = 'title', graph_type = 'violin' )
The range of all residuals is binned into 5 equidistant groups. LL Low-Low, ML - Medium, M Medium, MH - Medium-High, HH High-High. And the predictions of an individual observation are tracked between models and grouped into flows.
p = f_plot_alluvial_1v1( data = pl_pred , col_x = 'title' , col_y = 'resid' , col_id = 'idx' , fill_by = 'all_flows' , order_levels_x = c('randomForest', 'svm') ) p
We can see that the residuals of single observations fall mostly into the same category for each model. Transisition from a High (MH, HH) to a Low range (LM,LL) or the other way around are very rare if wee look at the to better performing models randomFores and svm. We can thus conclude that the two models catch up on similar traits of the data. We also find a consistency in the observations that are poorly predicted. In a next step we will investigate what sets them apart from the other observations that are predicted better.
f_tag = function(x,y,z){ if( x == 'M' | y == 'M' | z == 'M'){ return( 'Med' ) } else if( stringr::str_detect(x,'L') & stringr::str_detect(y,'L') & stringr::str_detect(z,'L')){ return( 'Low' ) }else if( stringr::str_detect(x,'H') & stringr::str_detect(y,'H') & stringr::str_detect(z,'H')){ return( 'High' ) } else { 'Cross' } } df = p$data_key %>% mutate_if( is.factor, as.character ) %>% ## factors will be passed as integers py pmap mutate( tag = pmap_chr( list( x = randomForest, y = rpart, z = svm), f_tag ) , tag = as.factor(tag) , idx = as.integer(idx) ) %>% select( idx, tag ) pl_tag = pl_pred %>% left_join( df ) p = f_plot_alluvial_1v1( data = pl_tag , col_x = 'title' , col_y = 'resid' , col_fill = 'tag' , fill_right = F , col_id = 'idx' , order_levels_x = c('randomForest', 'svm') , order_levels_fill = c('Low', 'Cross', 'Med', 'High') ) p
variables are ordered by importance from left to right
order_variables = pl %>% unnest(imp) %>% group_by( row_names ) %>% summarise( rank_sum = sum(rank) ) %>% arrange( rank_sum ) %>% .$row_names # will exclude the one observation which we put in the test set data_ls_new = data %>% left_join(df) %>% select( tag, displacement, order_variables ) %>% f_clean_data() f_plot_alluvial(data_ls_new$data, order_levels = c('Low', 'Cross', 'Med', 'High') ) + theme( axis.text.x = element_text(angle = 90))
taglist = f_stat_group_ana(data_ls_new , 'tag' , alluvial = F , tabplot = F , static_plots = F , return_taglist = T ) taglist
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.