knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 3.5, fig.width = 6, fig.path = "../man/figures/insuRglm_vignette-" ) library(insuRglm) library(magrittr)
Generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The underlying model is still linear, but it is related to the response variable via a link function.
This type of model is implemented in many software packages, including R base glm
function. However, in insurance setting, the methodology for building GLM models is sometimes different from the usual use. While there's certainly an opportunity for automation to narrow down the predictor space, the core predictors are usually selected very carefully with almost manual approach. It is considered important to inspect every predictor from multiple angles and make sure it has both statistical and business significance.
This approach is usually carried out through the use of specialized commercial software products, which offer graphical user interface and limit the possibility of any automation or customization. The insuRglm package aims to provide an opensource and transparent alternative, which can be integrated into existing R workflows, thus allowing bigger degree of automation and customization.
To illustrate the use of this package, we will use modified subset of insurance data available at http://www.acst.mq.edu.au/GLMsforInsuranceData. Our response (or target) variable will be claim severity, while the list of potential predictors includes vehicle value, vehicle age, vehicle body, area and driver's age category. Let's inspect the dataset first.
data('sev_train') head(sev_train)
First step of using the package should always be the one-time setup
function.
setup <- setup( data_train = sev_train, target = 'sev', weight = 'numclaims', family = 'gamma', keep_cols = c('pol_nbr', 'exposure', 'premium') )
Note that if you don't specify the simple_factors
argument, all the dataset columns other than target
, weight
and keep_cols
will be considered simple_factors
. These should have less than 255 unique values and will be converted to factor
class. For working with other target distributions, please see the documentation of setup
function.
Basic information about the target variable is already printed to the console by the setup
function. If we wish to explore little bit further, it is possible by using explore_target
function.
explore_target(setup)
Some target distributions may contain a lot of zeros or low values, therefore it might be desirable to exclude them from the analysis, or limit the visualized data to specific quantile range. This is done through arguments exlude_zero
, lower_quantile
or upper_quantile
.
explore_target(setup, lower_quantile = 0.05, upper_quantile = 0.95)
In case we want to see the same information in textual form, we can do it with type = 'tabular'
.
explore_target(setup, type = 'tabular', lower_quantile = 0.05, upper_quantile = 0.95)
We can also explore the potential predictors in either visual or tabular form. The results will show the weighted average of target variable across the levels of the corresponding predictor.
Let's look at the visual form for the single predictor.
explore_data(setup, factors = 'agecat')
Alternatively, let's inspect three predictors in tabular form. Leaving the factors
argument blank will create these results for all available potential predictors.
explore_data(setup, type = 'tabular', factors = c('pol_yr', 'agecat', 'veh_body'))
We can also get a two-way view by using the by
argument.
explore_data(setup, factors = 'agecat', by = 'pol_yr')
explore_data(setup, type = 'tabular', factors = c('agecat', 'veh_body'), by = 'pol_yr')
The setup
function produces an object with class setup. Such object is usually the first argument of functions in this package. Moreover, most of them also return this object, with modified attributes and sub-objects. This makes the functions fully compatible with %>%
operator, as we can see in the following example.
setup %>% explore_target(type = 'tabular', n_cuts = 5)
This becomes even more useful when we start with the modeling workflow. These functions will be explained shortly.
setup %>% factor_add(pol_yr) %>% factor_add(agecat) %>% model_fit()
Model formula defines the structure of the GLM model. The target is fixed after the creation of setup object, however, we can add predictors using factor_add
.
setup %>% factor_add(pol_yr) %>% factor_add(agecat)
If we ever decide to remove the predictor from the current model formula, we can do so, without having to modify the preceding code.
setup %>% factor_add(pol_yr) %>% factor_add(agecat) %>% model_fit() %>% factor_remove(agecat)
We can modify the model formula at any stage of the workflow, however, the model has to be fit (or re-fit) for this change to take effect.
modeling <- setup %>% factor_add(pol_yr) %>% factor_add(agecat) %>% model_fit()
We can visualize model predictors from the last fitted model by using model_visualize
.
modeling %>% model_visualize(factors = 'fitted')
We can also inspect the unfitted variables.
modeling %>% model_visualize(factors = 'unfitted')
Since the package currently supports only categorical predictors, every category of each predictor is fitted as a separate dummy variable. We can use factor_modify
to decrease the model complexity by creating custom_factor
or variate
.
Custom factor will still remain categorical, but some of the levels will be merged together, based on the mapping that user provides. This mapping has to be of the same length as number of unique levels of the corresponding predictor. Assigning the same number to two different levels will merge them together. This is usually done with categorical variables where order of levels doesn't matter.
In this example, areas 'A' and 'D' will be merged together. Also areas 'E' and 'F' will be merged together with.
modeling <- setup %>% factor_add(pol_yr) %>% factor_add(area) %>% factor_modify(area = custom_factor(area, mapping = c(1, 2, 3, 1, 4, 4))) %>% model_fit()
Variate, on the other hand, will be converted to a numeric variable, simplifying the predictor to only one coefficient (supposing the polynomial degree is 1). This is usually done with originally continuous variables.
Non-proportional variate (type = 'non_prop')
is usually used when the distances between the categorical levels of a predictor are numerically similar. In this case a mapping vector has to be provided. These values will be substituted instead of the original values.
modeling <- setup %>% factor_add(pol_yr) %>% factor_add(agecat) %>% factor_modify(agecat = variate(agecat, type = 'non_prop', mapping = c(1, 2, 3, 4, 5, 6))) %>% model_fit()
On the other hand, proportional variate (type = 'prop'
) is best used when the distances between the categorical levels of a predictor are significantly different. In this case, the mapping will be created automatically.
modeling <- setup %>% factor_add(pol_yr) %>% factor_add(veh_value) %>% factor_modify(veh_value = variate(veh_value, type = 'prop')) %>% model_fit()
However, it is required that the names of original levels contain a numeric range, as below.
setup %>% explore_data(type = 'tabular', factors = 'veh_value')
After simplifying a model predictor, it might be useful to visualize changes. We can compare the current (latest) model to a previous reference model by using model_visualize
with ref_models
parameters. We can use model_save
at any point of the workflow, to save a reference model. Note, that each model has to be fit using model_fit
before saving.
modeling <- setup %>% factor_add(pol_yr) %>% factor_add(agecat) %>% model_fit() %>% model_save('model1') %>% factor_modify(agecat = variate(agecat, type = 'non_prop', mapping = c(1, 2, 3, 4, 5, 6))) %>% model_fit()
We can compare actual values against the fitted values of the comparison models by using this code.
modeling %>% model_visualize(ref_models = 'model1')
If we wish to compare model predictions at base levels, we can do so by changing the y_axis
to linear
modeling %>% model_visualize(y_axis = "linear", ref_models = 'model1')
If we ever decide to discard current model and revert to an older one (saved by model_save
), we can do so, by using model_revert
. This might also be useful if we want to run functions that work only on the current (latest) model in the workflow, like model_visualize
. Moreover, this also helps to keep the workflow documented, instead of rewriting previous code.
modeling <- setup %>% factor_add(pol_yr) %>% factor_add(agecat) %>% model_fit() %>% model_save('model1') %>% factor_modify(agecat = variate(agecat, type = 'non_prop', mapping = c(1, 2, 3, 4, 5, 6))) %>% model_fit() %>% model_revert(to = 'model1') # from now on the two lines above have no effect (but they stay documented)
If we wish to inspect beta coefficients of the current (latest) model, we can do so by using model_betas
function.
modeling %>% model_betas()
If we want to instead look at the differences between coefficients of each predictor, we can do so by using setting triangles = TRUE
.
modeling %>% model_betas(triangles = TRUE)
We can produce a lift chart showing the comparison between actual and predicted values of target variable across groups of ordered observations. This can be done either for the latest model or for all saved models in the workflow. Note that, by default, the predictions will be created once and using the full training dataset.
modeling <- setup %>% factor_add(pol_yr) %>% factor_add(agecat) %>% model_fit() %>% model_save('model1') %>% factor_add(veh_value) %>% model_fit() %>% model_save('model2') %>% factor_add(veh_age) %>% model_fit() modeling %>% model_lift(model = 'current') # can be also 'all' modeling %>% model_lift(model = 'current', buckets = 5)
Crossvalidation lets us assess the model performance more realistically, but it's computationally more intensive. Using the model_crossval
will trigger creation of multiple datasets and re-fitting of each model structure on all of them. Each record will be scored by a model trained on a dataset that didn't include that specific record.
modeling_cv <- modeling %>% model_crossval(cv_folds = 10, stratified = FALSE) # this is also the default
We can now look on the lift charts based on the crossvalidated predictions.
modeling_cv %>% model_lift(data = 'crossval', model = 'current') modeling_cv %>% model_lift(data = 'crossval', model = 'current', buckets = 5)
We can use model_compare
to compare the performance (or significance) of multiple models present within the workflow.
If we want to do a nested model significance test, we can do so by specifying type = 'nested_model_test'
. Note that here we don't really need crossvalidated predictions.
modeling_cv %>% model_compare(type = 'nested_model_test')
We can also look at the RMSE metric across all the models. If we have crossvalidated predictions available, the crossvalidation performance will also appear. By default, the predictions are ordered and grouped into 10 buckets (like for lift chart). We can change this behaviour by using buckets
argument.
modeling_cv %>% model_compare(type = 'rmse') modeling_cv %>% model_compare(type = 'rmse', buckets = 20)
If we are satisfied with the current (latest) model, we can export it and create a xlsx file. The spreadsheet will contain the charts, as well as relativities and weights for each predictor included in the model.
modeling %>% model_export('export_test.xlsx', overwrite = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.