knitr::opts_chunk$set(echo = TRUE)
suppressPackageStartupMessages( require(scorecard) ) suppressPackageStartupMessages( require(tidyverse) ) suppressPackageStartupMessages( require(oetteR) )
Weight of evidence (WOE), Information Value (IV) and Score values are common terms when you encounter credit risk modelling in the financial industry. Basically these terms describe certain practices when modelling credit risk with logistic regression and they have been around since the 1950. These techniques do not come up in more modern statistic books when teaching logistic regression, so I wonder what their advantages are. So I read up on them.
This is basically a technique that can be applied if we have a binary response variable and any kind of predictor variable. First we perform a reasonable binning on the response variable and then decide which form of the binary response we count as positive and which as negative. Then we calculate the percentage positive cases in each bin of the total of all positive cases. For example 20 positive cases in bin A out of 100 total positive cases in all bins equals 20 %. Next we calculate the percentage of negative cases in each bin of the total of all negative cases, for example 5 negative cases in bin A out of a total of 50 negative cases in all bins equals 10%. Then we calculate the WOE by dividing the bin percentages of positive cases by the bin percentage of negative cases, and take the logarithm. For the described example log(20/10).
Rule of thump: If WOE values are negative, negative cases supersede the positive cases. If WOE values are positive, positive cases supersede the negative cases.
This serves the following purposes:
- We eliminate any none-linear relationships
- We automatically scale all variables too some extend
- We convert categorical variables to contineous variables
- Missing Data can be handled as just another factor value
- We can built a stand alone score card, that could be manually applied by a person with a pen and a printout of all relevant variables.
It has the following disadvantages:
- We always loose information via binning
- Score development along single variables is not contineous and occurs in steps
- Binning requires manual revision
- Calculating Variable importance is not as straight forward as with classical logistic regression with regularly scaled variables
By doing another sequence of calculations similar to the WOE calculation we can calculate the IV. Classically this serves as variable ranking method and allows us to perform feature selection, which is less compuationally demanding as other methods.
tribble( ~`Information Value`, ~`Predictive Power` , '< 0.02' , 'useless for prediction' , '0.02 - 0.1' , 'weak predictor' , '0.1 - 0.3' , 'medium predictor' , '0.3 - 0.5' , 'strong predictor' , '> 0.5' , 'suspicious too good to be true') %>% knitr::kable( align = c('cl') )
The score is the converted prediction of the model into a score value. The score is of a given arbitrary range with a defined slope and is normally distributed. The score reflects the increase or decrease in odds, with High Score Values reflect a low probability of a modelled event.
The score comes with a scorecard in which all contibuting variables are represented and linked to a specific score value. The sum of all score values that can be attributed to each variable is the total score. For example the credit score of an 18-year-old customer falls into the 18-25 age bin, a group with increased credit risk which reduces the overall credit score by 50 points. Such a score card is sufficient to calculate the total score if one has the values of all variables that contribute to the score.
We will use the scorecard
package and an example dataset to investigate the concepts of WOE, IV and Score and test its implementation in R
data('germancredit') data = germancredit %>% as_tibble() #replace '.' in variable names not compatible with f_train_lasso vars = names(data) %>% str_replace_all( '\\.', '_') names(data) <- vars # convert response factor variable to dummy variable data = data %>% mutate( creditability = ifelse( creditability == 'bad', 1, 0 ) , creditability = as.factor(creditability) ) summary(data)
No mising data in dataset
Amelia::missmap(data)
We can use scorecard::iv()
to calculate the information values
iv = iv(data, y = 'creditability') %>% as_tibble() %>% mutate( info_value = round(info_value, 3) ) %>% arrange( desc(info_value) ) iv %>% knitr::kable()
We can use scorecard::woebin()
to automatically create the bins and calculate the WOE values
bins = woebin(data, y = 'creditability')
scorecard::woebin()
returns a list with one element for each variable. There also is a plotting function that we can use to make meaningful plots and check the binning scorecard::woebin_plot()
.
bins$duration_in_month %>% knitr::kable() woebin_plot(bins$duration_in_month) bins$other_debtors_or_guarantors %>% knitr::kable() woebin_plot(bins$other_debtors_or_guarantors)
We can take the list with all the binning information and pass ist to scorecard::woebin_ply
in order to transform our dataset into an all WOE value dataset
data_woe = woebin_ply( data, bins ) %>% as_tibble()
We can reduce the variables that enter our feature selection process by filtering all variables with IV < 0.02
.
We would then procede to use forward or backward stepwise feature selection, which are somewhat deprecated methods. wikipedia
We prefer to use lasso for feature selection in this case, since we do not have the same computational problems as in the old days we can run the lasso algorithm on all features without prefiltering by IV.
glm
with lasso and crossvalidataionI prefer to use oetteR::f_train_lasso
because it creates a more tidy output and figures and most noteworthy already returns a usable formula for the optimal features.
A vignette for that function can be found here
However oetteR::f_train_lasso
it is simply a wrapper for glmnet::glmnet()
set.seed(1) vars = names(data_woe) vars = vars[ vars != 'creditability'] formula = as.formula( paste( 'creditability ~', paste( vars , collapse = '+') ) ) lasso = oetteR::f_train_lasso( data = data_woe , p = NULL , formula = formula , k = 50 , family = 'binomial' )
plotly::ggplotly( lasso$plot_mse ) p = lasso$plot_coef + theme( legend.position = 'none') plotly::ggplotly( p, tooltip = c('x','y','color'))
the dashed line marks the lambda with min(MSE) the solid line marks lambda for which MSE is in range of min(MSE)+SEM.
We select the highest lambda value whose MSE is in range of the min(MSE)+SEM.
r lasso$formula_str_lambda_1se
lasso$tib_all %>% filter(lambda == lambda_1se) %>% select( lambda_1se, auc, n_coeff_before_lasso, n_coeff_after_lasso) %>% knitr::kable()
We can eliminate 6 out of 20 variables applying the lasso.
Formula logistic regression :
$$ln\left(\frac{P(X)}{1-P(X)}\right) = intercept + \beta_1x + \beta_nx $$ the term on the left is called the link function its result is the logit value
formula = as.formula( lasso$formula_str_lambda_1se ) m = glm( formula, data_woe, family = 'binomial') broom::tidy( m ) %>% mutate( star = oetteR::f_stat_stars(p.value) ) %>% oetteR::f_datatable_universal( round_other_nums = 2, page_length = nrow(.) )
We can convert the logit value to odds by $$e^{logit}$$
The odds can be converted to probability P by $$P(X)=\frac{odds}{odds+1}$$
Classically we would use predict()
with type = 'response'
to directly get the probability. However here we will do the calculations manually as described above.
pred = predict(m) resp = predict(m, type = 'response') res = tibble( logit = pred , odds = exp(pred) , prob = odds / (odds + 1) , prob_ctrl = resp ) res %>% f_datatable_universal( page_length = 10, round_other_nums = 5 )
Define a target:
Target Score Value (ts): 600
Inverted Target Odds (to): 50
*Read as: at my target score 600 the ods should be 1:50 *
Define slope:
points to double the odds (pdo): 20
*Read as the odds should double every 20 points *
Scoring is a bit counter intuitive the higher the score value the lower is the probability. Thus the highest score value will be associated with the lowest odds value. So when deciding on target score ( ts ), target odds( to ) and slope( pdo ) we are actually just defining a starting value and then decrease these values gradually by the rate of the slope until we cover the whole odds range. If we decide on a low target score with a steep slope we will again end up with negative values at the lower end of the score.
$$score = offset - factor\ ln(odds)$$ $$factor = \frac{pdo}{ln(2)}$$ $$offset = ts - factor\ ln(to)$$ replace odds with logit
$$odds = e^{logit}$$
$$score = offset - factor\ logit$$
We can use scorecard::scorecard()
in order to convert the logit to a score
points0 = 600 odds0 = 20 pdo = 50
r points0
r odds0
r pdo
card = scorecard( bins , m , points0 = points0 , odds0 = 1/odds0 # scorecard wants the inverse , pdo = pdo ) sc = scorecard_ply( data, card ) res$score = sc[[1]]
The card list is a fully functional score cards of which we will just print one example
card[[2]]
As a control we are going to calculate them manually using the formulas described above.
factor = pdo / log(2) offset = points0 - factor * log( odds0 ) res$score_ctrl = offset - factor * res$logit res %>% arrange( desc(score) ) %>% f_datatable_universal( page_length = 10, round_other_nums = 5 ) summary(res)
We can see from those summaries that our manual calculations are correct. We can use the simple formula
$$score = offset - factor \ logit$$
to calculate score values for any total logit value now with
r offset
r factor
Here we will compare the distributions and relationships between odds, score, logit and probability in order to better undestand the score values.
# filter control values res = res %>% select( - ends_with('_ctrl') ) res %>% gather( key = 'key', value = 'value', - logit ) %>% ggplot( aes( logit, value, color = key) ) + geom_point() + geom_line() + facet_wrap(~key, scales = 'free_y')
We can see that the score is perfectly linearly correlated with the logit
res %>% mutate( score = score * - 1 ) %>% gather( key = 'key', value = 'value', - odds ) %>% ggplot( aes( odds, value, color = key) ) + geom_point() + geom_line() + facet_wrap(~key, scales = 'free_y') res %>% mutate( score = score * - 1 ) %>% mutate_at( vars(logit, prob, score), scale ) %>% gather( key = 'key', value = 'value', - odds ) %>% ggplot( aes( odds, value, color = key) ) + geom_point( alpha = 0.5 ) + geom_line()
We can see that the relationship between odds and score and odds and logit is identical
res %>% gather( key = 'key', value = 'value' ) %>% ggplot( aes(value) ) + geom_histogram( bins = 50 , fill = 'aquamarine3' , color = 'black' ) + geom_rug()+ facet_wrap(~key, scales = 'free') res %>% select( logit, score ) %>% mutate_all( scale, center = T ) %>% mutate_all( as.vector ) %>% gather( key = 'key', value = 'value' ) %>% ggplot( )+ geom_histogram( aes( x = value, fill = key ) , bins = 50 , position="identity" , alpha = 0.5 )
Score and Logit also have identical distributions
$$logit = intercept + \beta_1x_1 + \beta_nx_n$$
Clasically one would look at the P-values in the model`s summary statistics. However this is particularly flawed when you have colinear variables, which results in reduced P values for all affected variables. Also when having to pick one variable which maximally decreases modelling quality the variable with the lowest P Value is not always the best choice.
A better method is to scale and center the variables before fitting and then rank the variables by their absolute coefficient values $\beta$. The interpretation being that a large absolute value of $x$ is higher the higher its absolute coefficient $\beta$. Numerical variables range from 0 to 1 because of the scaling and categorical variables would be split into dummy variables so we intuitevely know their range as 1 or 0.
However this is now a bit difficult since we replaced all our variable values with the WOE. We do not intuitively know the WOE range of each variable. We simply know that it follows a logarithmic distribution and values >1 and <-1 will be sparse. Thus a high value of $\beta$ could theoretically be multiplied by any range of $x$.
The best solution that we can come up with is to simply multiply coefficient and IV. Which returns sensible results.
imp = tibble( variable = names( coef(m) ) , coef = coef(m) ) %>% mutate( variable = map_chr( variable, function(x) unlist( str_split(x, '_woe') )[[1]] ) ) %>% left_join( iv ) %>% mutate( imp = abs(coef) * info_value ) %>% arrange( desc(imp) ) knitr::kable( imp, align = 'lccc', digits = 2 )
The actually best method of ranking the variables by importance is to remove them from the model and measure the reduction im performance. This method could be applied to all type of models and allow comparisons. However this approach is computationally quite expensive.
In order to interpret the individual results for each customer we need to calculate the logit manually using the coefficients and look at the individual terms of the regression function.
First we calculate all $\beta_n x_n$ logits for all observations
data_relevant = data_woe[, names( coef(m) )[-1] ] data_mult_logit = as_tibble( data_relevant * coef(m)[-1] )
Then we calculate the score value for each logit value and the intercept.
$$BaseScore = offset - factor * intercept$$ $$SummandScore = factor * logit$$
$$TotalScore = \sum SummandScore + BaseScore $$
data_mult_score = data_mult_logit %>% mutate_all( function(x) - factor * x ) %>% mutate( intercept = coef(m)[1] , intercept = offset - factor * intercept ) score = apply( data_mult_score, 1, sum ) data_mult_score$score = score data_mult_score$score_ctrl = res$score data_mult_score %>% select( score, score_ctrl, intercept, everything() ) %>% head(10) %>% knitr::kable()
including original value and WOE tranformed value
# correct variable names new_names_score = names(data_mult_score) %>% str_replace_all( '_woe', '') new_names_data_relevant = names(data_relevant) %>% str_replace_all( '_woe', '') names(data_mult_score) <- new_names_score names(data_relevant) <- new_names_data_relevant obs1_woe = data_relevant[1,] %>% mutate( rwn = row_number() ) %>% select( rwn, everything() ) %>% f_manip_transpose_tibble() obs1_values = data[1,] %>% mutate( rwn = row_number() ) %>% select( rwn, everything() ) %>% f_manip_transpose_tibble() obs1_score = data_mult_score[1,] %>% mutate( rwn = row_number() ) %>% select( rwn, everything() ) %>% f_manip_transpose_tibble() obs1_score %>% left_join( obs1_woe , by = 'row_names' ) %>% left_join( obs1_values, by = 'row_names') %>% left_join( iv, by = c('row_names' = 'variable') ) %>% arrange( desc(info_value) ) %>% rename( variable = row_names , `score` = `1.x` , `woe` = `1.y` , `value` = `1`) %>% knitr::kable( digits = 2, align = 'lcccc' )
We will build a small modelling dataframe, which consists of two differently transformed data sets, two individual formulas and a rsample
object.
recipes
for data transformationrsample
for 10 x 10 cross validationglmnet
via oetteR
for feature selectioncaret
for modelling and performance predictionrecipes
We will create dummy variables and use Yeo Johnson (like BoxCox but is tolerant of negative values) transformation on the contineous variables.
require(recipes) rec = recipe( data, creditability ~ . ) %>% step_dummy( all_predictors(), - all_numeric() ) %>% step_scale( all_numeric() ) %>% step_center( all_numeric() ) %>% step_YeoJohnson( all_numeric() ) rec_prep = prep( rec, data) summary(rec_prep) data_trans = bake( rec_prep, data) #unfortuntatly also recipes uses '.' in the dummy variable names which are not #compatible with f_train_lasso new_names = names(data_trans) %>% str_replace_all( '\\.', '_') %>% unlist() names(data_trans) <- new_names formula = new_names[ new_names != 'creditability'] %>% paste( collapse = '+' ) formula = paste( 'creditability ~', formula ) %>% as.formula()
set.seed(1) lasso_reg = f_train_lasso(data_trans , formula , p = NULL , family = 'binomial' , k = 50)
r lasso_reg$formula_str_lambda_1se
lasso_reg$tib_all %>% filter(lambda == lambda_1se) %>% select( lambda_1se, auc, n_coeff_before_lasso, n_coeff_after_lasso) %>% knitr::kable()
We can eleminate 27 out of 48 variables, remember that we are now carrying dummy variables.
We are using rsample
for the 10 x 10 cross validation
tib = tribble( ~ data_name, ~ data , ~ formula , 'woe' , data_woe , lasso$formula_str_lambda_1se , 'regular', data_trans , lasso_reg$formula_str_lambda_1se ) %>% mutate( rs = map( data, rsample::vfold_cv, v = 10, repeats = 10) , rs = map( rs, rsample::rsample2caret ) , method = 'glm' , formula = map( formula, as.formula ) , data = map(data, mutate , creditability = ifelse(creditability == 1, 'bad', 'good') , creditability = as.factor(creditability) ) ) tib
caret
wr_car = function( formula, rsample, data, method, ... ){ grid = as.data.frame(grid) car = caret::train( formula , data , method = method , trControl = caret::trainControl(index = rsample$index , indexOut = rsample$indexOut , method = 'cv' , verboseIter = F , savePredictions = T , classProbs = T , allowParallel = F , summaryFunction = caret::twoClassSummary ) , ... ) return( car ) }
tib = tib %>% mutate( car = pmap( list(formula, rs, data, method), wr_car , family = 'binomial') , results = map(car,'results' ) ) %>% unnest( results , drop. = F )
tib_perf = tib %>% select( data_name, ROC, ROCSD ) %>% gather( key = 'metric', value = 'value', - data_name, - ends_with('SD') ) %>% gather( key = 'metric_SD', value = 'SD', - data_name, - metric, - value ) %>% select( - metric_SD ) %>% mutate( SEM = SD / sqrt(100) ) tib_perf %>% knitr::kable( digits = 3, align = c('lcccccc') ) tib_perf %>% ggplot( ) + geom_pointrange( aes( x = data_name , y = value , ymin = value - SEM , ymax = value + SEM , color = data_name ) ) + labs( x = '', y = 'AUC')
The WOE method improves modelling performance by 1.5 %, which is not much but is considerable.
We find that tranforming logit, odds or pobabilities to a score value has its benefits when it comes to judging the contribution of each variable to the total score. Further we find that using WOE instead of a more modern approach of data preparation allows us build the model as a score card for which different percentiles of predictor variables contribute a defined score towards the total credit score.
Further we find that this approach does not decrease the performance of the model quite the contrary for this sample dataset performance is actually increased by 1.5%
We have not yeat calibrated the model. Meaning we have not checked whether the calculated probabilities truly reflect the actual probabilities. Applied predictive modelling p. 249
The German Credit data set is an artificial dataset in a sense that it is most likely that occurrences of credit default have been oversampled since the default ratio is 700:300.
To fully implement a working algorithm in R for a real credit risk data set we would propably have to use SMOTE or ROSE oversampling methods. Those methods are implemented in caret
but probably are not yet compatible with recipes
. But this remains to be tested. A description can be found in Applied predictive modelling p. 419
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.