Generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The underlying model is still linear, but it is related to the response variable via a link function.
This type of model is implemented in many software packages, including R
base glm
function. However, in insurance setting, the methodology for
building GLM models is sometimes different from the usual use. While
there’s certainly an opportunity for automation to narrow down the
predictor space, the core predictors are usually selected very carefully
with almost manual approach. It is considered important to inspect every
predictor from multiple angles and make sure it has both statistical and
business significance.
This approach is usually carried out through the use of specialized commercial software products, which offer graphical user interface and limit the possibility of any automation or customization. The insuRglm package aims to provide an opensource and transparent alternative, which can be integrated into existing R workflows, thus allowing bigger degree of automation and customization.
To illustrate the use of this package, we will use modified subset of insurance data available at http://www.acst.mq.edu.au/GLMsforInsuranceData. Our response (or target) variable will be claim severity, while the list of potential predictors includes vehicle value, vehicle age, vehicle body, area and driver’s age category. Let’s inspect the dataset first.
data('sev_train')
head(sev_train)
#> pol_nbr pol_yr exposure premium gender agecat area veh_body veh_age veh_value numclaims sev
#> 1 195816 2000 0.11498973 411.06572 F 1 D STNWG 1 (2.44,34.6] 1 200.00
#> 2 409083 2003 0.85694730 70.55145 M 6 A SEDAN 4 [0,0.9] 1 390.00
#> 3 427808 2003 0.08487337 92.67664 M 1 B HBACK 3 [0,0.9] 1 389.95
#> 4 484861 2001 0.24366872 235.18832 F 4 C STNWG 2 (2.44,34.6] 1 1997.84
#> 5 102484 2002 0.44626968 382.87836 M 4 E STNWG 3 (2.44,34.6] 1 1075.01
#> 6 282811 2000 0.62696783 609.22254 M 4 F UTE 1 (2.44,34.6] 1 28012.83
First step of using the package should always be the one-time setup
function.
setup <- setup(
data_train = sev_train,
target = 'sev',
weight = 'numclaims',
family = 'gamma',
keep_cols = c('pol_nbr', 'exposure', 'premium')
)
#> [1] "Setup - OK"
#> [1] ""
#> [1] "Train Data:"
#> [1] "Number of Observations: 3699"
#> [1] "Weighted Average Target: 1865.06"
#> [1] "Max. Target: 46868.18"
#> [1] "Min. Target: 200"
#> [1] ""
Note that if you don’t specify the simple_factors
argument, all the
dataset columns other than target
, weight
and keep_cols
will be
considered simple_factors
. These should have less than 255 unique
values and will be converted to factor
class. For working with other
target distributions, please see the documentation of setup
function.
Basic information about the target variable is already printed to the
console by the setup
function. If we wish to explore little bit
further, it is possible by using explore_target
function.
explore_target(setup)
Some target distributions may contain a lot of zeros or low values,
therefore it might be desirable to exclude them from the analysis, or
limit the visualized data to specific quantile range. This is done
through arguments exlude_zero
, lower_quantile
or upper_quantile
.
explore_target(setup, lower_quantile = 0.05, upper_quantile = 0.95)
In case we want to see the same information in textual form, we can do
it with type = 'tabular'
.
explore_target(setup, type = 'tabular', lower_quantile = 0.05, upper_quantile = 0.95)
#> $train
#> 5.0% 9.5% 14.0% 18.5% 23.0% 27.5% 32.0% 36.5% 41.0% 45.5% 50.0% 54.5% 59.0%
#> 200.0000 200.0000 200.0000 295.1551 353.7700 353.8000 389.9500 408.9500 480.0000 576.5054 674.6400 785.1999 920.4734
#> 63.5% 68.0% 72.5% 77.0% 81.5% 86.0% 90.5% 95.0%
#> 1067.6950 1261.4188 1498.5665 1773.6818 2190.2062 2798.3678 3574.3797 4755.5450
We can also explore the potential predictors in either visual or tabular form. The results will show the weighted average of target variable across the levels of the corresponding predictor.
Let’s look at the visual form for the single predictor.
explore_data(setup, factors = 'agecat')
Alternatively, let’s inspect three predictors in tabular form. Leaving
the factors
argument blank will create these results for all available
potential predictors.
explore_data(setup, type = 'tabular', factors = c('pol_yr', 'agecat', 'veh_body'))
#> $pol_yr
#> # A tibble: 5 x 4
#> pol_yr weight_sum target_sum target_avg
#> <fct> <int> <dbl> <dbl>
#> 1 2000 791 1451350. 1925.
#> 2 2001 785 1493467. 1971.
#> 3 2002 785 1266940. 1728.
#> 4 2003 821 1421615. 1831.
#> 5 2004 779 1392400. 1871.
#>
#> $agecat
#> # A tibble: 6 x 4
#> agecat weight_sum target_sum target_avg
#> <fct> <int> <dbl> <dbl>
#> 1 1 423 1078124. 2661.
#> 2 2 804 1474176. 1934.
#> 3 3 944 1574173. 1763.
#> 4 4 975 1613592. 1747.
#> 5 5 512 804438. 1631.
#> 6 6 303 481270. 1667.
#>
#> $veh_body
#> # A tibble: 13 x 4
#> veh_body weight_sum target_sum target_avg
#> <fct> <int> <dbl> <dbl>
#> 1 BUS 9 11759. 1430.
#> 2 CONVT 2 6656. 3328.
#> 3 COUPE 64 130713. 2338.
#> 4 HBACK 1047 1955833. 1945.
#> 5 HDTOP 111 193516. 1826.
#> 6 MCARA 14 9860. 736.
#> 7 MIBUS 39 79541. 2101.
#> 8 PANVN 52 102250. 2040.
#> 9 RDSTR 3 785. 456.
#> 10 SEDAN 1294 2057836. 1679.
#> 11 STNWG 1007 1785463. 1865.
#> 12 TRUCK 107 234310. 2367.
#> 13 UTE 212 457254. 2243.
We can also get a two-way view by using the by
argument.
explore_data(setup, factors = 'agecat', by = 'pol_yr')
explore_data(setup, type = 'tabular', factors = c('agecat', 'veh_body'), by = 'pol_yr')
#> $agecat
#> # A tibble: 30 x 5
#> agecat pol_yr weight_sum target_sum target_avg
#> <fct> <fct> <int> <dbl> <dbl>
#> 1 1 2000 85 255978. 3117.
#> 2 1 2001 91 285453. 3143.
#> 3 1 2002 85 160858. 2173.
#> 4 1 2003 93 184741. 2029.
#> 5 1 2004 69 191094. 2914.
#> 6 2 2000 148 235461. 1686.
#> 7 2 2001 155 327748. 2187.
#> 8 2 2002 166 243134. 1618.
#> 9 2 2003 174 377382. 2287.
#> 10 2 2004 161 290450. 1864.
#> # ... with 20 more rows
#>
#> $veh_body
#> # A tibble: 59 x 5
#> veh_body pol_yr weight_sum target_sum target_avg
#> <fct> <fct> <int> <dbl> <dbl>
#> 1 BUS 2000 2 7412. 3706.
#> 2 BUS 2001 1 372. 372.
#> 3 BUS 2002 1 462. 462.
#> 4 BUS 2003 2 1523. 761.
#> 5 BUS 2004 3 1990. 1035.
#> 6 CONVT 2001 1 530 530
#> 7 CONVT 2004 1 6126. 6126.
#> 8 COUPE 2000 19 21077. 1409.
#> 9 COUPE 2001 9 25008. 2779.
#> 10 COUPE 2002 11 20881. 2800.
#> # ... with 49 more rows
The setup
function produces an object with class setup. Such object is
usually the first argument of functions in this package. Moreover, most
of them also return this object, with modified attributes and
sub-objects. This makes the functions fully compatible with %>%
operator, as we can see in the following example.
setup %>% explore_target(type = 'tabular', n_cuts = 5)
#> $train
#> 0% 20% 40% 60% 80% 100%
#> 200.00 345.00 497.05 1052.15 2493.90 46868.18
This becomes even more useful when we start with the modeling workflow. These functions will be explained shortly.
setup %>%
factor_add(pol_yr) %>%
factor_add(agecat) %>%
model_fit()
#> [1] "Target: sev"
#> [1] "Weight: numclaims"
#> [1] "Actual Predictors: pol_yr, agecat"
#> [1] "Available Factors: gender, area, veh_body, veh_age, veh_value"
Model formula defines the structure of the GLM model. The target is
fixed after the creation of setup object, however, we can add predictors
using factor_add
.
setup %>%
factor_add(pol_yr) %>%
factor_add(agecat)
#> [1] "Target: sev"
#> [1] "Weight: numclaims"
#> [1] "Actual Predictors: pol_yr, agecat"
#> [1] "Available Factors: gender, area, veh_body, veh_age, veh_value"
If we ever decide to remove the predictor from the current model formula, we can do so, without having to modify the preceding code.
setup %>%
factor_add(pol_yr) %>%
factor_add(agecat) %>%
model_fit() %>%
factor_remove(agecat)
#> [1] "Target: sev"
#> [1] "Weight: numclaims"
#> [1] "Actual Predictors: pol_yr"
#> [1] "Available Factors: gender, agecat, area, veh_body, veh_age, veh_value"
We can modify the model formula at any stage of the workflow, however, the model has to be fit (or re-fit) for this change to take effect.
modeling <- setup %>%
factor_add(pol_yr) %>%
factor_add(agecat) %>%
model_fit()
We can visualize model predictors from the last fitted model by using
model_visualize
.
modeling %>%
model_visualize(factors = 'fitted')
We can also inspect the unfitted variables.
modeling %>%
model_visualize(factors = 'unfitted')
Since the package currently supports only categorical predictors, every
category of each predictor is fitted as a separate dummy variable. We
can use factor_modify
to decrease the model complexity by creating
custom_factor
or variate
.
Custom factor will still remain categorical, but some of the levels will be merged together, based on the mapping that user provides. This mapping has to be of the same length as number of unique levels of the corresponding predictor. Assigning the same number to two different levels will merge them together. This is usually done with categorical variables where order of levels doesn’t matter.
In this example, areas ‘A’ and ‘D’ will be merged together. Also areas ‘E’ and ‘F’ will be merged together with.
modeling <- setup %>%
factor_add(pol_yr) %>%
factor_add(area) %>%
factor_modify(area = custom_factor(area, mapping = c(1, 2, 3, 1, 4, 4))) %>%
model_fit()
Variate, on the other hand, will be converted to a numeric variable, simplifying the predictor to only one coefficient (supposing the polynomial degree is 1). This is usually done with originally continuous variables.
Non-proportional variate (type = 'non_prop')
is usually used when the
distances between the categorical levels of a predictor are numerically
similar. In this case a mapping vector has to be provided. These values
will be substituted instead of the original values.
modeling <- setup %>%
factor_add(pol_yr) %>%
factor_add(agecat) %>%
factor_modify(agecat = variate(agecat, type = 'non_prop', mapping = c(1, 2, 3, 4, 5, 6))) %>%
model_fit()
On the other hand, proportional variate (type = 'prop'
) is best used
when the distances between the categorical levels of a predictor are
significantly different. In this case, the mapping will be created
automatically.
modeling <- setup %>%
factor_add(pol_yr) %>%
factor_add(veh_value) %>%
factor_modify(veh_value = variate(veh_value, type = 'prop')) %>%
model_fit()
However, it is required that the names of original levels contain a numeric range, as below.
setup %>%
explore_data(type = 'tabular', factors = 'veh_value')
#> $veh_value
#> # A tibble: 5 x 4
#> veh_value weight_sum target_sum target_avg
#> <fct> <int> <dbl> <dbl>
#> 1 [0,0.9] 654 1259035. 2016.
#> 2 (0.9,1.32] 743 1295165. 1818.
#> 3 (1.32,1.71] 816 1368978. 1760.
#> 4 (1.71,2.44] 853 1548348. 1935.
#> 5 (2.44,34.6] 895 1554247. 1822.
After simplifying a model predictor, it might be useful to visualize
changes. We can compare the current (latest) model to a previous
reference model by using model_visualize
with ref_models
parameters.
We can use model_save
at any point of the workflow, to save a
reference model. Note, that each model has to be fit using model_fit
before saving.
modeling <- setup %>%
factor_add(pol_yr) %>%
factor_add(agecat) %>%
model_fit() %>%
model_save('model1') %>%
factor_modify(agecat = variate(agecat, type = 'non_prop', mapping = c(1, 2, 3, 4, 5, 6))) %>%
model_fit()
We can compare actual values against the fitted values of the comparison models by using this code.
modeling %>%
model_visualize(ref_models = 'model1')
If we wish to compare model predictions at base levels, we can do so by
changing the y_axis
to linear
modeling %>%
model_visualize(y_axis = "linear", ref_models = 'model1')
If we ever decide to discard current model and revert to an older one
(saved by model_save
), we can do so, by using model_revert
. This
might also be useful if we want to run functions that work only on the
current (latest) model in the workflow, like model_visualize
.
Moreover, this also helps to keep the workflow documented, instead of
rewriting previous code.
modeling <- setup %>%
factor_add(pol_yr) %>%
factor_add(agecat) %>%
model_fit() %>%
model_save('model1') %>%
factor_modify(agecat = variate(agecat, type = 'non_prop', mapping = c(1, 2, 3, 4, 5, 6))) %>%
model_fit() %>%
model_revert(to = 'model1') # from now on the two lines above have no effect (but they stay documented)
If we wish to inspect beta coefficients of the current (latest) model,
we can do so by using model_betas
function.
modeling %>%
model_betas()
#> # A tibble: 10 x 5
#> factor actual_level estimate std_error std_error_pct
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 (Intercept) (Intercept) 7.45 0.0774 1%
#> 2 pol_yr 2000 0.0459 0.0868 189%
#> 3 pol_yr 2001 0.0595 0.0869 146%
#> 4 pol_yr 2002 -0.0527 0.0869 165%
#> 5 pol_yr 2004 0.0278 0.0871 313%
#> 6 agecat 1 0.418 0.101 24%
#> 7 agecat 2 0.104 0.0830 79%
#> 8 agecat 3 0.0147 0.0795 541%
#> 9 agecat 5 -0.0682 0.0950 139%
#> 10 agecat 6 -0.0459 0.115 249%
If we want to instead look at the differences between coefficients of
each predictor, we can do so by using setting triangles = TRUE
.
modeling %>%
model_betas(triangles = TRUE)
#> $pol_yr
#> # A tibble: 5 x 6
#> pol_yr `2003` `2000` `2001` `2002` `2004`
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2003 "" "" "" "" ""
#> 2 2000 "189%" "" "" "" ""
#> 3 2001 "146%" "647%" "" "" ""
#> 4 2002 "165%" "89%" "78%" "" ""
#> 5 2004 "313%" "485%" "278%" "109%" ""
#>
#> $agecat
#> # A tibble: 6 x 7
#> agecat `4` `1` `2` `3` `5` `6`
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 4 "" "" "" "" "" ""
#> 2 1 "24%" "" "" "" "" ""
#> 3 2 "79%" "33%" "" "" "" ""
#> 4 3 "541%" "25%" "93%" "" "" ""
#> 5 5 "139%" "24%" "57%" "115%" "" ""
#> 6 6 "249%" "28%" "78%" "190%" "567%" ""
We can produce a lift chart showing the comparison between actual and predicted values of target variable across groups of ordered observations. This can be done either for the latest model or for all saved models in the workflow. Note that, by default, the predictions will be created once and using the full training dataset.
modeling <- setup %>%
factor_add(pol_yr) %>%
factor_add(agecat) %>%
model_fit() %>%
model_save('model1') %>%
factor_add(veh_value) %>%
model_fit() %>%
model_save('model2') %>%
factor_add(veh_age) %>%
model_fit()
modeling %>%
model_lift(model = 'current') # can be also 'all'
modeling %>%
model_lift(model = 'current', buckets = 5)
Crossvalidation lets us assess the model performance more realistically,
but it’s computationally more intensive. Using the model_crossval
will
trigger creation of multiple datasets and re-fitting of each model
structure on all of them. Each record will be scored by a model trained
on a dataset that didn’t include that specific record.
modeling_cv <- modeling %>%
model_crossval(cv_folds = 10, stratified = FALSE) # this is also the default
We can now look on the lift charts based on the crossvalidated predictions.
modeling_cv %>%
model_lift(data = 'crossval', model = 'current')
modeling_cv %>%
model_lift(data = 'crossval', model = 'current', buckets = 5)
We can use model_compare
to compare the performance (or significance)
of multiple models present within the workflow.
If we want to do a nested model significance test, we can do so by
specifying type = 'nested_model_test'
. Note that here we don’t really
need crossvalidated predictions.
modeling_cv %>%
model_compare(type = 'nested_model_test')
#> # A tibble: 4 x 5
#> model intercept_model model1 model2 current_model
#> <chr> <chr> <chr> <chr> <chr>
#> 1 intercept_model "" "" "" ""
#> 2 model1 "Pr(>F) = 0.00074" "" "" ""
#> 3 model2 "Pr(>F) = 0.00177" "Pr(>F) = 0.33055" "" ""
#> 4 current_model "Pr(>F) = 0.00355" "Pr(>F) = 0.40076" "Pr(>F) = 0.43922" ""
We can also look at the RMSE metric across all the models. If we have
crossvalidated predictions available, the crossvalidation performance
will also appear. By default, the predictions are ordered and grouped
into 10 buckets (like for lift chart). We can change this behaviour by
using buckets
argument.
modeling_cv %>%
model_compare(type = 'rmse')
modeling_cv %>%
model_compare(type = 'rmse', buckets = 20)
If we are satisfied with the current (latest) model, we can export it and create a xlsx file. The spreadsheet will contain the charts, as well as relativities and weights for each predictor included in the model.
modeling %>%
model_export('export_test.xlsx', overwrite = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.