```{css, echo = FALSE} pre code, pre, code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; }
<style> body { text-align: justify} </style> ```r knitr::opts_chunk$set(collapse = TRUE, comment = '#>', cache = TRUE, cache.lazy = FALSE) library(magrittr) library(ggplot2)
library(distRforest)
The use of distRforest
will be illustrated with the ausprivauto0405
dataset from the package CASdatasets
:
Third party insurance is a compulsory insurance for vehicle owners in Australia. It insures vehicle owners against injury caused to other drivers, passengers or pedestrians, as a result of an accident. The
ausprivauto0405
dataset is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67856 policies, of which 4624 had at least one claim.
library(CASdatasets) data(ausprivauto0405)
The ausprivauto0405
dataset is a r class(ausprivauto0405)
with r nrow(ausprivauto0405)
observations and r ncol(ausprivauto0405)
variables (r names(ausprivauto0405)
):
str(ausprivauto0405)
Variables of interest are introduced when needed. For a full description see ?CASdatasets::ausprivauto0405
.
This section introduces the functions to build a random forest and make predictions from it. Afterwards, examples of binary classification, Poisson regression and Gamma regression illustrate how to use them.
To build a random forest with the distRforest
package, call the function rforest(formula, data, method,
weights = NULL, parms = NULL, control = NULL, ncand, ntrees, subsample = 1, track_oob = FALSE,
keep_data = FALSE, red_mem = FALSE)
with the following arguments:
formula
: object of the class formula
with a symbolic description for the model to be fitted of the form response ~ var1 + var2 + var3
without interactions. Please refrain from applying transformation functions to the response, but add the transformed variable to the data
beforehand. Two exceptions exist, see method = 'poisson'
and method = 'exp'
below.data
: data frame containing the training data observations.method
: string specifying the type of forest to build. Options are:'class'
: classification forest.'anova'
: standard regression forest with a squared error loss.'poisson'
: poisson regression forest for count data. The left-hand-side of formula
can be specified as cbind(observation_time, number_of_events)
to include time exposures.'gamma'
: gamma regression forest for strictly positive long-tailed data.'lognormal'
: lognormal regression forest for strictly positive long-tailed data.'exp'
: exponential scaling for survival data. The left-hand-side of formula
is specified as Surv(observation_time, event_indicator)
to include time exposures.weights
: optional name of the variable in data
to use as case weights. Either as a string or simply the variable name should work.parms
: optional parameters for the splitting function, see ?distRforest::rpart
for the details and allowed options.control
: list of options that control the fitting details of the individual rpart
trees. Use distRforest::rpart.control
to set this up.ncand
: integer specifying the number of randomly chosen variable candidates to consider at each node to find the optimal split.ntrees
: integer specifying the number of trees in the ensemble.subsample
: numeric in the range [0,1]. Each tree in the ensemble is built on randomly sampled data of size subsample * nrow(data)
.track_oob
: boolean to indicate whether the out-of-bag errors should be tracked (TRUE
) or not (FALSE
). This option is not implemented for method = 'exp'
or multi-class classification. For the other methods, the following errors are tracked. All the errors are evaluated in a weighted version if weights
are supplied.class
: Matthews correlation coefficient for binary classification.anova
: mean squared error.poisson
: Poisson deviance.gamma
: gamma deviance.lognormal
: mean squared error.keep_data
: boolean to indicate whether the data
should be saved with the fit. It is not advised to set this to TRUE
for large data sets.red_mem
: boolean whether to reduce the memory footprint of the rpart
trees by eliminating non-essential elements from the fits. It is adviced to set this to TRUE
for large values of ntrees
.The function returns an object of class rforest
which is a list containing the following elements:
trees
: list of length equal to ntrees
, containing the individual rpart
trees in the forest.oob_error
: numeric vector of length equal to ntrees
, containing the OOB error at each iteration (if track_oob = TRUE
).data
: the training data
(if keep_data = TRUE
).Predictions from a random forest can be retrieved via the generic predict
function, which will call predict.rforest(object, newdata)
with arguments:
object
: fitted model object from the class rforest
.newdata
: data frame containing the observations to predict. This argument can only be missing when the random forest in object
is trained with keep_data = TRUE
. In that case, the original training data will be used to generate predictions.The function returns a numeric vector containing a prediction for each observation. A majority vote among individual trees is taken for a binary classification forest, while the predictions of the individual trees are averaged for normal, poisson, gamma and lognormal regression forests.
Assume that you want to model which type of policyholder in the portfolio is more likely to file a claim. The variable ClaimOcc
in the ausprivauto0405
data has the value 1
for policyholders who filed a claim and 0
otherwise. An insurance claim is an unlikely event, as most policyholders do not file a claim:
ausprivauto0405$ClaimOcc %>% table %>% prop.table
It is important that your binary classification response is a numeric or factor with the value 0/1 for the negative/positive class to make sure that everything runs smoothly!
Let's build a binary classification forest for claim occurrence on a (naively) balanced dataset:
# Balance the data ausprivauto0405_balanced <- rbind(ausprivauto0405[ausprivauto0405$ClaimOcc == 1, ], ausprivauto0405[ausprivauto0405$ClaimOcc == 0, ][1:5000, ]) # Build the random forest set.seed(54321) rf_class <- rforest(formula = ClaimOcc ~ VehValue + VehAge + VehBody + Gender + DrivAge, data = ausprivauto0405_balanced, method = 'class', control = rpart.control(minsplit = 20, cp = 0, xval = 0, maxdepth = 5), ncand = 3, ntrees = 200, subsample = 0.5, track_oob = TRUE, keep_data = TRUE, red_mem = TRUE)
The fit is of the class rforest
, which is a list
containing the individual trees
, the oob_error
and the data
.
class(rf_class) names(rf_class) rf_class[['trees']][[1]]
The OOB error evolution (track_oob = TRUE
in rforest
) shows an increasing trend in Matthews correlation coefficient, which means that the classification is improving over the iterations:
oob_df <- data.frame('iteration' = seq_len(length(rf_class[['oob_error']])), 'oob_error' = rf_class[['oob_error']]) ggplot(oob_df, aes(x = iteration, y = oob_error)) + geom_point()
Sidenote: Matthews correlation coefficient is chosen because this measure takes into account all four elements of the confusion matrix. Measures like accuracy, precision, recall or the F1 score ignore at least one of them.
Predictions from the random forest can be compared to the true values to assess performance. A reasonable amount of observations are classified falsely, but this is likely driven by the limited number of iterations and variables involved to model claim occurrence. Note that there is no need to specify newdata
in predict
as keep_data = TRUE
in rforest
. If keep_data = FALSE
then newdata = ausprivauto0405_balanced
is needed.
pred_df <- data.frame('true' = ausprivauto0405_balanced$ClaimOcc, 'pred' = predict(rf_class)) sprintf('True positives: %i', sum(pred_df$true == 1 & pred_df$pred == 1)) sprintf('False positives: %i', sum(pred_df$true == 0 & pred_df$pred == 1)) sprintf('True negatives: %i', sum(pred_df$true == 0 & pred_df$pred == 0)) sprintf('False negatives: %i', sum(pred_df$true == 1 & pred_df$pred == 0))
Although most policyholders do not file a claim in the portfolio, some of them file more than one claim. The variable ClaimNb
in the ausprivauto0405
data contains the number of claims filed by a specific policyholder. The variable Exposure
contains the fraction of the year that a policyholder was covered by the policy and therefore exposed to the risk of filing a claim. This information should be taken into account as filing a claim during one year or one month of exposure represents a different risk.
ausprivauto0405$ClaimNb %>% table %>% prop.table ausprivauto0405$Exposure %>% quantile(probs = seq(0, 1, 0.2))
Let's build a Poisson regression forest which takes the exposure into account via cbind
in the formula
:
# Build the random forest set.seed(54321) rf_poiss <- rforest(formula = cbind(Exposure, ClaimNb) ~ VehValue + VehAge + VehBody + Gender + DrivAge, data = ausprivauto0405, method = 'poisson', parms = list('shrink' = 10000000), control = rpart.control(minsplit = 20, cp = 0, xval = 0, maxdepth = 5), ncand = 3, ntrees = 200, subsample = 0.5, track_oob = TRUE, keep_data = TRUE, red_mem = TRUE)
The fit is of the class rforest
, which is a list
containing the individual trees
, the oob_error
and the data
.
class(rf_poiss) names(rf_poiss) rf_poiss[['trees']][[1]]
The OOB error evolution (track_oob = TRUE
in rforest
) shows a decreasing trend in the Poisson deviance, which means that the predictions for the claim numbers are improving over the iterations:
oob_df <- data.frame('iteration' = seq_len(length(rf_poiss[['oob_error']])), 'oob_error' = rf_poiss[['oob_error']]) ggplot(oob_df, aes(x = iteration, y = oob_error)) + geom_point()
Predictions from the random forest can be compared to the true values to assess performance. Note that predictions from a Poisson forest are given on a scale of full time exposure (i.e., setting Exposure = 1
in our case), so you need to multiply predictions with observed Exposure
values. Policyholders are split in 5 groups based on their predicted values, going from low to high risk, and the mean of the observed number of claims is calculated per group. The increasing trend shows that the Poisson forest is able to model the risk properly:
pred_df <- data.frame('true' = ausprivauto0405$ClaimNb, 'pred' = predict(rf_poiss) * ausprivauto0405$Exposure) split_df <- pred_df %>% split(cut(pred_df$pred, breaks = quantile(pred_df$pred, probs = seq(0, 1, 0.2)), labels = c('lowest risk', 'low risk', 'medium risk', 'high risk', 'highest risk'))) lapply(split_df, function(df_sub) mean(df_sub$true))
Besides estimating how frequent a policyholder will file a claim, it is also important to get an idea of the actual severity of the claims in money terms. The variable ClaimAmount
in the ausprivauto0405
data contains the sum of claim payments over all claims filed by a specific policyholder. To approximate the individual claim amounts, a new variable ClaimAvg
is defined for the average claim payment, but only for those policyholders actually filing a claim. These claim amounts are clearly long-tailed, which calls for appropriate statistical assumptions:
ausprivauto0405_claims <- ausprivauto0405[ausprivauto0405$ClaimOcc == 1, ] ausprivauto0405_claims$ClaimAvg <- with(ausprivauto0405_claims, ClaimAmount / ClaimNb) ausprivauto0405_claims$ClaimAvg %>% quantile(probs = seq(0, 1, 0.2))
Let's build a gamma regression forest for the average claim amount and the number of claims as case weights:
# Build the random forest set.seed(54321) rf_gamma <- rforest(formula = ClaimAvg ~ VehValue + VehAge + VehBody + Gender + DrivAge, data = ausprivauto0405_claims, weights = ClaimNb, method = 'gamma', control = rpart.control(minsplit = 20, cp = 0, xval = 0, maxdepth = 5), ncand = 3, ntrees = 200, subsample = 0.5, track_oob = TRUE, keep_data = TRUE, red_mem = TRUE)
The fit is of the class rforest
, which is a list
containing the individual trees
, the oob_error
and the data
.
class(rf_gamma) names(rf_gamma) rf_gamma[['trees']][[1]]
The OOB error evolution (track_oob = TRUE
in rforest
) shows a decreasing trend in the gamma deviance, which means that the predictions for the claim amounts are improving over the iterations:
oob_df <- data.frame('iteration' = seq_len(length(rf_gamma[['oob_error']])), 'oob_error' = rf_gamma[['oob_error']]) ggplot(oob_df, aes(x = iteration, y = oob_error)) + geom_point()
Predictions from the random forest can be compared to the true values to assess performance. Note that the predictions are being made for the average claim amount, so you need to multiply predictions with observed ClaimNb
values to get the aggregate claim cost prediction. Policyholders are split in 5 groups based on their predicted values, going from low to high risk, and the mean of the observed claim amounts is calculated per group. The increasing trend shows that the gamma forest is able to model the risk properly:
pred_df <- data.frame('true' = ausprivauto0405_claims$ClaimAmount, 'pred' = predict(rf_gamma) * ausprivauto0405_claims$ClaimNb) split_df <- pred_df %>% split(cut(pred_df$pred, breaks = quantile(pred_df$pred, probs = seq(0, 1, 0.2)), labels = c('lowest risk', 'low risk', 'medium risk', 'high risk', 'highest risk'))) lapply(split_df, function(df_sub) mean(df_sub$true))
The distRforest
package allows for an easy calculation of variable importance scores for an rforest
object. The function importance_rforest
takes one argument, namely the fitted rforest
object. The result is a data frame with one row per variable and four columns:
variable
: name of the variable.importance
: average importance score, taken over all the individual trees.scale_sum
: scaled scores which sum to one.scale_max
: scaled scores such that the maximum value is equal to one.Assessing the importance of each variable in the three forests built before shows a rather uniform ranking:
rf_class %>% importance_rforest rf_poiss %>% importance_rforest rf_gamma %>% importance_rforest
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.