```{css, echo = FALSE} pre code, pre, code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; }

<style>
body {
text-align: justify}
</style>

```r
knitr::opts_chunk$set(collapse = TRUE, comment = '#>', cache = TRUE, cache.lazy = FALSE)
library(magrittr)
library(ggplot2)
library(foreach)
library(maidrr)

MTPL datasets included in maidrr

The maidrr package contains two motor third party liability (MTPL) insurance portfolios, one from Belgium (BE) and one from France (FR). Both will be used to illustrate the use of maidrr, so let's have a quick look at them:

data('mtpl_be') ; str(mtpl_be, give.attr = FALSE)
data('mtpl_fr') ; str(mtpl_fr, give.attr = FALSE)

The details on all the features included in the BE and FR portfolio are available in the package documentation. These can be retrieved via ?maidrr::mtpl_be and ?maidrr::mtpl_fr for the BE and FR portfolio respectively.

Black box algorithm to start from

A Poisson GBM is trained on each MTPL portfolio in order to model and predict the claim frequency, i.e., the number of claims filed by a policyholder. Note that the tuning parameter values are chosen rather arbitrarily as the tuning of an optimal black box algorithm is not the purpose of this vignette. The goal of maidrr is to obtain an interpretable surrogate model that approximates your black box as closely as possible. Let's get started!

library(gbm)
set.seed(54321)
gbm_be <- gbm(nclaims ~ offset(log(expo)) + ageph + power + bm + agec + coverage + fuel + sex + fleet + use,
              data = mtpl_be, distribution = 'poisson', shrinkage = 0.01, n.trees = 500, interaction.depth = 3)
gbm_fr <- gbm(nclaims ~ offset(log(expo)) + ageph + power + bm + agec + fuel + brand + region,
              data = mtpl_fr, distribution = 'poisson', shrinkage = 0.01, n.trees = 500, interaction.depth = 3)

Workflow of the maidrr procedure

This section describes the four main functions in maidrr that facilitate the workflow from black box to surrogate:

  1. insights: obtain insights from a black box model in the form of partial dependence (PD) feature effects.
  2. segmentation: use PDs to group feature values/levels and segment observations in a data-driven way.
  3. surrogate: fit a surrogate generalized linear model (GLM) with factor features to the segmented data.
  4. explain: explain the prediction of a surrogate GLM via the contribution of all features (and interactions).

These operations are designed in a way that allows for piping via %>% from magrittr to create a neat workflow: bb_fit %>% insights(...) %>% segmentation(...) %>% surrogate(...) %>% explain(...)

Insights from the black box

Call insights(mfit, vars, data, interactions = 'user', hcut = 0.75, pred_fun = NULL, fx_in = NULL, ncores = -1) with arguments:

To illustrate the use of the argument pred_fun, we define the following function for our GBMs:

gbm_fun <- function(object, newdata) mean(predict(object, newdata, n.trees = object$n.trees, type = 'response'))

For the FR portfolio, we ask for the effects on all features in the GBM and one user-specified interaction:

fx_vars_fr <- gbm_fr %>% insights(vars = c(gbm_fr$var.names, 'power_fuel'),
                                  data = mtpl_fr,
                                  interactions = 'user',
                                  pred_fun = gbm_fun)

For the BE portfolio, we ask for the effects on all features in the GBM and the auto-selected interactions:

fx_vars_be <- gbm_be %>% insights(vars = gbm_be$var.names,
                                  data = mtpl_be,
                                  interactions = 'auto',
                                  hcut = 0.75,
                                  pred_fun = gbm_fun)

The output is a named list of tibble objects containing the effects, one for each feature and interaction:

class(fx_vars_be)
names(fx_vars_be)

The tibble for a main effect has r ncol(fx_vars_be[['ageph']]) columns with standardized names r names(fx_vars_be[['ageph']]) containing the feature value, the effect and the number of observations counts in data respectively:

fx_vars_be[['ageph']]

The tibble for an interaction effect has r ncol(fx_vars_be[['bm_agec']]) columns with standardized names r names(fx_vars_be[['bm_agec']]) containing the feature values, the effect and the number of observations counts in data respectively:

fx_vars_be[['bm_agec']]

Notice the difference in the scale of the feature effect, contained in column y, for a main and interaction effect. This comes from the fact that the interaction effects are calculated as pure interactions by subtracting both 1D effects from the 2D effect. Therefore, the main effects will be centered around the observed claim frequency, which equals to r round(sum(mtpl_be$nclaims)/sum(mtpl_be$expo), digits = 4) for the BE portfolio, while the interaction effects are centered around zero:

unlist(lapply(fx_vars_be, function(fx) weighted.mean(fx$y, fx$w)))

The function maidrr::plot_pd can be used to visualize the effects. This is illustrated for a continuous feature in the BE portfolio and a categorical feature in the FR portfolio, where darker colours or larger dots indicate higher observation counts for those feature values:

gridExtra::grid.arrange(fx_vars_be[['ageph']] %>% plot_pd + ggtitle('MTPL BE'), 
                        fx_vars_fr[['brand']] %>% plot_pd + ggtitle('MTPL FR'),
                        ncol = 2)

The function maidrr::plot_pd can also be used to visualize the interactions. This is illustrated for two continuous features in the BE portfolio and two categorical features in the FR portfolio. It is important to understand that these represent corrections on the main effects of those features. So young policyholders driving a high powered car receive an extra penalty on top of their main effects in the BE portfolio. In the FR portfolio one can conclude that for low powered cars diesel is more risky than gasoline, but this behaviour reverses as of power level 7 and the difference becomes even bigger as of power level 9.

gridExtra::grid.arrange(fx_vars_be[['ageph_power']] %>% plot_pd + ggtitle('MTPL BE'), 
                        fx_vars_fr[['power_fuel']] %>% plot_pd + ggtitle('MTPL FR'),
                        ncol = 1)

The goal of the maidrr procedure is to group values/levels within a feature which are showing similar behaviour. Regions where the effect is quite stable can be grouped together, for example:

Performing such a grouping in an optimal, automatic and data-driven way is the goal of maidrr, stay tuned!

Note on variable selection: In the above examples we simply used all the features from the GBM (via $var.names on the gbm object). You might want to exclude unimportant features from your analysis to save computation time when many features are used in your black box model. The functions maidrr::get_vi and maidrr::plot_vi allow to obtain some insights on the important features in the black box. The former calculates variable importance scores for all features in the model (tibble), while the latter plots the results (ggplot). These functions can be used to determine which features to supply to the vars argument in insights.

gridExtra::grid.arrange(gbm_be %>% get_vi %>% plot_vi + ggtitle('MTPL BE'),
                        gbm_fr %>% get_vi %>% plot_vi + ggtitle('MTPL FR'),
                        ncol = 2)

Helper functions

The function insights streamlines the exploration process by making use of several maidrr helper functions:

These functions can be used on a stand-alone basis to perform your own tailored analysis. Details on the use of these functions and input requirements are available in the documentation of maidrr.

Segmantation of the data

Call segmentation(fx_vars, data, type, values, max_ngrps = 15) with arguments:

For any feature, let n be the unique number of observed levels/values in the data and y the calculated effect for each level/value. When the feature is split in k groups, let g represent the average value of y within the groups. A penalized loss function is defined as follows: $$ \frac{1}{n} \sum_{i=1}^{n} \left( y_i - g_i \right)^2 + \lambda \log(k) \,. $$ The first part measures how well the effect y is approximated by the grouped variant g in the form of a mean squared error over all levels/values of the feature. The second part measures the complexity of the grouping by means of the common logarithm of the number of groups k for the feature. The penalty parameter $\lambda$ acts as a bias-variance trade-off. A low value will allow a lot of groups, resulting in an accurate approximation of the effect. A high value will enforce using fewer groups, resulting in a coarse approximation of the effect. For a specified value of $\lambda$, the optimal number of groups k follows by minimizing the penalized loss function. The grouping in maidrr is done via the Ckmeans.1d.dp package.

We segment the BE portfolio by specifying feature-specific number of groups. The grouped features are added to the data with a trailing underscore in their column name:

gr_data_be <- fx_vars_be %>% segmentation(data = mtpl_be,
                                          type = 'ngroups',
                                          values = setNames(c(7, 4, 9, 3, 2, 2, 2, 1, 1, 2, 3, 4, 2, 3),
                                                            unlist(lapply(fx_vars_be, comment))))
head(gr_data_be)
gr_data_be %>% 
  dplyr::select(dplyr::ends_with('_')) %>% 
  dplyr::summarise_all(dplyr::n_distinct)

We segment the FR portfolio by specifying a single $\lambda$ value for all features and the interaction:

gr_data_fr <- fx_vars_fr %>% segmentation(data = mtpl_fr,
                                          type = 'lambdas',
                                          values = 0.000001)
gr_data_fr %>% 
  dplyr::select(dplyr::ends_with('_')) %>% 
  dplyr::summarise_all(dplyr::n_distinct)

It is possible to use maidrr::plot_pd with maidrr::group_pd to get an idea of the actual grouping of features:

gridExtra::grid.arrange(fx_vars_be[['ageph']] %>% group_pd(ngroups = 7) %>% plot_pd + ggtitle('MTPL BE'), 
                        fx_vars_fr[['brand']] %>% group_pd(ngroups = 5) %>% plot_pd + ggtitle('MTPL FR'),
                        ncol = 2)

The same can be done for the two interaction effects that we saw earlier:

gridExtra::grid.arrange(fx_vars_be[['ageph_power']] %>% group_pd(ngroups = 3) %>% plot_pd + ggtitle('MTPL BE'), 
                        fx_vars_fr[['power_fuel']] %>% group_pd(ngroups = 4) %>% plot_pd + ggtitle('MTPL FR'),
                        ncol = 2)

The choice of the value(s) for $\lambda$ is the most important aspect in the maidrr procedure, as it will determine the level of segmentation in your data. You can tune this parameter yourself or rely on the maidrr::autotune function, which is introduced later. First, we still need to cover the topic of fitting the actual surrogate model.

Helper functions

The function segmentation streamlines the grouping process by making use of several maidrr helper functions:

These functions can be used on a stand-alone basis to perform your own tailored analysis. Details on the use of these functions and input requirements are available in the documentation of maidrr.

Fitting a surrogate GLM

Call surrogate(data, par_list) with arguments:

It is important to only include featues with at least 2 groups in the formula, the rest is captured by the intercept.

Let's fit a surrogate Poisson GLM to the segmented BE portfolio with exposure as offset (same specs as GBM):

features_be <- gr_data_be %>% 
  dplyr::select(dplyr::ends_with('_')) %>% 
  dplyr::summarise_all(~dplyr::n_distinct(.) > 1) %>% 
  unlist %>% which %>% names
features_be

formula_be <- as.formula(paste('nclaims ~', paste(features_be, collapse = '+')))
formula_be

glm_be <- gr_data_be %>% surrogate(par_list = alist(formula = formula_be,
                                                    family = poisson(link = 'log'),
                                                    offset = log(expo)))
glm_be

Some conclusions can be drawn fast and easily from the fitted GLM coeffients:

Let's also fit a surrogate Poisson GLM to the segmented FR portfolio with exposure as offset:

features_fr <- gr_data_fr %>% 
  dplyr::select(dplyr::ends_with('_')) %>% 
  dplyr::summarise_all(~dplyr::n_distinct(.) > 1) %>% 
  unlist %>% which %>% names
features_fr

formula_fr <- as.formula(paste('nclaims ~', paste(features_fr, collapse = '+')))
formula_fr

glm_fr <- gr_data_fr %>% surrogate(par_list = alist(formula = formula_fr,
                                                    family = poisson(link = 'log'),
                                                    offset = log(expo)))
glm_fr

Global model conclusions can again be drawn from the fitted GLM coefficients. However, let's turn the focus on explaining predictions for individual observations in the next section.

Explaining the predictions

Call explain(surro, instance, plt = TRUE) with arguments:

Let's explain the prediction made by the BE GLM for two policyholders (id 34 and 4938). Notice that the prediction for the former is below the intercept of the GLM, while it is more than doubled for the latter. Why is this the case? The low risk for id 34 is mainly thanks to a low bonus-malus level and high age. Driving a high powered car increases the risk on the other hand. The high risk for id 4938 is mainly due to a high bonus-malus level and young age. Note that these contributions and confidence intervals are shown on the response scale after taking the invere link function of the coefficients. For a Poisson GLM with log-link function this implies applying the exp function on the coefficients, with contributions on the multiplicative response scale and a value of 1 indicating no contribution.

exp(coef(glm_be)['(Intercept)'])
glm_be %>% predict(newdata = gr_data_be[34, ], type = 'response') / gr_data_be[34, 'expo']
glm_be %>% predict(newdata = gr_data_be[4938, ], type = 'response') / gr_data_be[4938, 'expo']

gridExtra::grid.arrange(glm_be %>% explain(instance = gr_data_be[34, ]) + ggtitle('ID 34'), 
                        glm_be %>% explain(instance = gr_data_be[4938, ]) + ggtitle('ID 4938'),
                        ncol = 2)

We can follow the same approach for the FR GLM for two policyholders (id 34 and 4938). The prediction for the former is below the GLM intercept, while it is more than doubled for the latter. Why is this the case? The low risk for id 34 is mainly thanks to a low bonus-malus level and driving a car of brand B12 (we have seen this before in the PD effect). The high risk for id 4938 is mainly driven by a high bonus-malus level.

exp(coef(glm_fr)['(Intercept)'])
glm_fr %>% predict(newdata = gr_data_fr[34, ], type = 'response') / gr_data_fr[34, 'expo']
glm_fr %>% predict(newdata = gr_data_fr[4938, ], type = 'response') / gr_data_fr[4938, 'expo']

gridExtra::grid.arrange(glm_fr %>% explain(instance = gr_data_fr[34, ]) + ggtitle('ID 34'), 
                        glm_fr %>% explain(instance = gr_data_fr[4938, ]) + ggtitle('ID 4938'),
                        ncol = 2)

These are of course just toy examples to illustrate the functionalty of maidrr, with random segmentation choices (the values for the number of groups or $\lambda$). Tuning the value of $\lambda$ is very important to obtain competitive prediction models. Therefore, we incorporate a function to automate this task for you: maidrr::autotune.

Automated lambda tuning via cross-validation

The maidrr workflow from black box to surrogate via insights %>% segmentation %>% surrogate is highly dependend on the value(s) of $\lambda$ supplied to maidrr::segmentation. The grouping and selection of features is entirely dependend on $\lambda$ via the penalized loss function. Any ad-hoc choice is most likely to result in a surrogate which is not really competitive with the original black box model, so it is important to choose the value(s) for $\lambda$ in an optimal way. To automate this tuning task, maidrr contains the autotune(...) function. This function iterates over a grid of $\lambda$'s, calculating cross-validation errors for each grid value, and returns an optimal surrogate GLM.

Call autotune(mfit, data, vars, target, max_ngrps = 15, hcut = 0.75, ignr_intr = NULL, pred_fun = NULL, lambdas = as.vector(outer(seq(1, 10, 0.1), 10^(-7:3))), nfolds = 5, strat_vars = NULL, glm_par = alist(), err_fun = mse, ncores = -1, out_pds = FALSE) with arguments:

Let's autotune the $\lambda$ parameter on the BE portfolio:

set.seed(5678)
tune_be <- gbm_be %>% autotune(data = mtpl_be,
                               vars = gbm_be$var.names,
                               target = 'nclaims',
                               hcut = 0.75,
                               pred_fun = gbm_fun,
                               lambdas = as.vector(outer(seq(1, 10, 0.1), 10^(-7:3))),
                               nfolds = 5,
                               strat_vars = c('nclaims', 'expo'),
                               glm_par = alist(family = poisson(link = 'log'),
                                               offset = log(expo)),
                               err_fun = poi_dev,
                               ncores = -1)

The output of maidrr::autotune is a list with the following elements:

tune_be

These are the optimal groupings obtained for the effects that were shown earlier in this vignette:

gridExtra::grid.arrange(fx_vars_be[['ageph']] %>% group_pd(ngroups = tune_be$slct_feat['ageph']) %>% plot_pd + ggtitle('MTPL BE'), 
                        fx_vars_be[['ageph_power']] %>% group_pd(ngroups = tune_be$slct_feat['ageph_power']) %>% plot_pd + ggtitle('MTPL BE'),
                        ncol = 2)


henckr/maidrr documentation built on July 27, 2023, 3:17 p.m.