What is Credit Scoring?

This package has been developed as part of a CIFRE PhD, a special PhD contract in France which is for the most part financed by a company. This company subsequently gets to choose which subject(s) are tackled.

This research has been financed by Crédit Agricole Consumer Finance (CA CF), subsidiary of the Crédit Agricole Group which provides all kinds of banking and insurance services. CA CF focuses on consumer loans, ranging from luxury cars to small electronics.

In order to accept / reject loan applications more efficiently (both quicker and to select better applicants), most financial institutions resort to Credit Scoring: given the applicant's characteristics he/she is given a Credit Score, which has been statistically designed using previously accepted applicants, and which partly decides whether the financial institution will grant the loan or not.

Context

In practice, the statistical modeler has historical data about each customer's characteristics. For obvious reasons, only data available at the time of inquiry must be used to build a future application scorecard. Those data often take the form of a well-structured table with one line per client alongside their performance (did they pay back their loan or not?) as can be seen in the following table:

knitr::kable(data.frame(Job = c("Craftsman", "Technician", "Executive", "Office employee"), Habitation = c("Owner", "Renter", "Starter", "By family"), Time_in_job = c(10, 20, 5, 2), Children = c(0, 1, 2, 3), Family_status = c("Divorced", "Widower", "Single", "Married"), Default = c("No", "No", "Yes", "No")))

Formulation

The variable to predict, here denoted by \emph{Default}, is an active research field and we will not discuss it here. We suppose we already have a binary random variable $Y$ from which we have $n$ observations $\mathbf{y} = (y_i)_1^n$.

The $d$ predictive features, here for example the job, habitation situation, etc., are usually socio-demographic features asked by the financial institutions at the time of application. They are denoted by the random vector $\boldsymbol{X} = (X_j)_1^d$ and as for $Y$ we have $n$ observations $\mathbf{x}=(x_i)_1^n$.

We suppose that observations $(\mathbf{x},\mathbf{y})$ come from an unknown distribution $p(x,y)$ which is not directly of interest. Our interest lies in the conditional probability of a client with characteristics $\boldsymbol{x}$ of paying back his loan, i.e. $p(y|\boldsymbol{x})$, also unknown.

In the context of Credit Scoring, we historically stick to logistic regression, for various reasons out of the scope of this vignette. The logistic regression model assumes the following relation between $\boldsymbol{X}$ (supposed continuous here) and $Y$: $$\ln \left( \frac{p_{\boldsymbol{\theta}}(Y=1|\boldsymbol{x})}{p_{\boldsymbol{\theta}}(Y=0|\boldsymbol{x})} \right) = (1, \boldsymbol{x})'{\boldsymbol{\theta}}$$

We would like to have the ‘‘best'' model compared to the true $p(y|\boldsymbol{x})$ from which we only have samples. Had we access to the true underlying model, we would like to minimize, w.r.t. ${\boldsymbol{\theta}}$, $H_{\boldsymbol{\theta}} = \mathbb(E){(X,Y) \sim p}[\ln(p{\boldsymbol{\theta}}(Y|\boldsymbol{X}))]$. Since this is not possible, we approximate this criterion by maximizing, w.r.t. $\theta$, the likelihood $\ell({\boldsymbol{\theta}};\mathbf{x},\mathbf{y}) = \sum_{i=1}^n \ln(p_{\boldsymbol{\theta}}(y_i|\boldsymbol{x}_i))$.

In R, this is done by fitting a \code{glm} model to the data:

library(scoringTools)
scoring_model <- glm(Default ~ ., data = lendingClub, family = binomial(link = "logit"))

We can now focus on the regression coefficients $\boldsymbol{\theta}$:

scoring_model$coefficients

and the deviance at this estimation of $\boldsymbol{\theta}$:

scoring_model$deviance

From this, it seems that Credit Scoring is pretty straightforward when the data is at hand.

Conceptual problems of current approaches to Credit Scoring

Nevertheless, there are a few theoretical limitations of the current approach, e.g.:

Problems tackled in this package

Two problems have been tackled so far in the Credit Scoring framework:

  1. Reject Inference,
  2. ‘‘Quantization'' of continuous (discretization) and qualitative (grouping) features,

Other packages

We released two other packages:

  1. Package glmdisc for ‘‘Quantization'' of continuous (discretization) and qualitative (grouping) features and interactions among covariates,
  2. Package glmtree for ‘‘Segmentation'' of clients into subpopulations with different scorecards: logistic regression trees.

Other packages focus on Credit Scoring, see e.g. this review paper.

Reject Inference

Context

Current acceptance system

From all applicants who get a Credit Score, there are three interesting sub-populations: the financed clients, who were granted a loan, the rejected applicants, who were rejected either by business rules (e.g. over-indebtedness) or because of a low Credit Score, and the not-taking up applicants who were offered a loan but decided not to take it (e.g. they don't need it anymore or they went to a competitor).

Obviously, the performance variable $Y$ is observed only for financed clients so that we have $n$ observations of financed clients $(\boldsymbol{x}_i,y_i)_1^n$ and $n'$ observations of not financed clients for who we only have the characteristics $(\boldsymbol{x}_i)_1^{n'}$.

Mathematical formulation

Strictly speaking, we have observations from $p(\boldsymbol{x},y,Z=f)$ and by fitting a logistic regression to this data, we subsequently estimate $p(y|\boldsymbol{x},Z=f)$ which is quite ‘‘different'' from $p(y|\boldsymbol{x})$. Since the Credit Score is to be applied to the whole population to decide whether to accept/reject clients, it seems that this can lead to a biased model, even asymptotically.

There are three important keys to understand if the resulting model is biased:

Theoretical findings

Our theoretical findings on this subject is discussed in Ehrhardt et al. (2017).

In short, using only financed clients' characteristics to learn a logistic regression model is asymptotically correct when the missingness mechanism is MAR and the model is true. We can easily show this by simulating data:

data_cont_simu <- function(n, d, k) {
  set.seed(k)
  x <- matrix(runif(n * d), nrow = n, ncol = d)
  theta <- c(1, -1)
  log_odd <- x %*% theta

  y <- rbinom(n, 1, 1 / (1 + exp(-log_odd)))

  return(list(x, y))
}

if (require(ggplot2, quietly = TRUE)) {
  data <- data_cont_simu(100, 2, 1)
  x <- data[[1]]
  y <- data[[2]]
  df <- data.frame(x = x, y = y)
  ggplot(df, aes(x = x.1, y = x.2, colour = factor(y))) +
    geom_point()

  data <- data_cont_simu(1000, 2, 1)
  x <- data[[1]]
  y <- data[[2]]
  df <- data.frame(x = x, y = y)
  hat_theta <- glm(y ~ . - 1, data = df, family = binomial(link = "logit"))
  df$decision <- factor(ifelse(predict(hat_theta, df, type = "response") > 0.7, "reject", "accept"))
  ggplot(df, aes(x = x.1, y = x.2, colour = decision)) +
    geom_point()

  theta_1 <- matrix(NA, ncol = 1, nrow = 1000)
  theta_2 <- matrix(NA, ncol = 1, nrow = 1000)
  theta_1_f <- matrix(NA, ncol = 1, nrow = 1000)
  theta_2_f <- matrix(NA, ncol = 1, nrow = 1000)
  for (k in 1:1000) {
    data <- data_cont_simu(1000, 2, k)
    x <- data[[1]]
    y <- data[[2]]
    df <- data.frame(x = x, y = y)
    hat_theta <- glm(y ~ . - 1, data = df, family = binomial(link = "logit"))

    theta_1[k] <- hat_theta$coefficients[1]
    theta_2[k] <- hat_theta$coefficients[2]

    df$decision <- factor(ifelse(predict(hat_theta, df, type = "response") > 0.6, "reject", "accept"))
    hat_theta_f <- glm(y ~ . - 1, data = df[df$decision == "accept", -ncol(df)], family = binomial(link = "logit"))

    theta_1_f[k] <- hat_theta_f$coefficients[1]
    theta_2_f[k] <- hat_theta_f$coefficients[2]
  }
  ggplot(data.frame(theta_1), aes(x = theta_1)) +
    geom_histogram() +
    geom_vline(xintercept = 1)
}

When the missingness mechanism is MNAR, $p(y|\boldsymbol{x},f) \neq p(y|\boldsymbol{x},nf)$ so that there is no way to ‘‘unbias'' the resulting model without introducing data from the financing mechanism and model $p(f|\boldsymbol{x},y)$.

When the model is false, we could make use of not financed clients to estimate $p(f|\boldsymbol{x})$ and consider this as an importance function in the Importance Sampling framework. However, this gives good results when the importance function is known and under probabilistic assumptions not met in our use case. Here it must be evaluated separately and simulations show that this estimation process also introduces bias and variance and subsequently does not improve upon the financed clients' model.

Reject Inference methods

To deal with the possible bias of fitting a logistic regression to the financed clients' data, Reject Inference methods have been proposed in the literature. We showed in that none of them could potentially give any good result. Nevertheless, we implemented them to compare them numerically.

Functions

In this package, we implemented Reject Inference methods which were described for example in and were supposed to enable credit risk modelers to use not-financed clients' characteristics in the logistic regression learning process. We demonstrated that these methods are not statistically grounded. We nevertheless implemented these methods to show these results numerically.

The first method is Fuzzy Augmentation as described in the Appendix of my PhD thesis.

xf <- as.matrix(df[df$decision == "accept", c("x.1", "x.2")])
xnf <- as.matrix(df[df$decision == "reject", c("x.1", "x.2")])
yf <- df[df$decision == "accept", "y"]
hat_theta_fuzzy <- fuzzy_augmentation(xf, xnf, yf)

The second method is Reclassification as described in the Appendix of my PhD thesis.

hat_theta_reclassification <- reclassification(xf, xnf, yf)

The third method is Augmentation as described in the Appendix of my PhD thesis.

hat_theta_augmentation <- augmentation(xf, xnf, yf)

The fourth method is Parcelling as described in the Appendix of my PhD thesis.

hat_theta_parcelling <- parcelling(xf, xnf, yf)

The fifth method is Twins as described in the Appendix of my PhD thesis.

hat_theta_twins <- twins(xf, xnf, yf)

Each of these functions output an S4 object named ‘‘reject_infered''. This object has slots:

hat_theta_augmentation@method_name
hat_theta_reclassification@financed_model
hat_theta_twins@acceptance_model
hat_theta_fuzzy@infered_model

Methods

To efficiently use the generated S4 objects of class ‘‘reject_infered'' several methods were implemented which we detail here.

print method

The print method shows the method used and the coefficients of infered_model, much like you would get from printing a \code{glm} object:

print(hat_theta_reclassification)

summary method

The summary method shows the method used alongside the coefficients of the financed model, eventually the acceptance model and the infered model with AIC values, much like you would get from doing a summary on a \code{glm} object:

summary(hat_theta_reclassification)

Quantization

Context

Under the term ‘‘quantization'', we refer to the process of transforming a continuous feature into a categorical feature which values uniquely correspond to intervals of the continuous feature and to the process of regrouping values of categorical feature.

There are a few advantages to discretizing the input features:

There are a few drawbacks as well:

These advantages and drawbacks are explained in-depth in .

Despite its limitations, CA CF decided to go on developping their scorecards by using logistic regression and discretizing their input features. However, with the growing number of input features in an era of Big Data, the increasing number of products and types of clients addressed and the simultaneous aging of their previous scorecards, they decided to have an automatic tool to generate production-ready scorecards by automizing the discretization process under constraints (which we'll develop later on) and the logistic regression fitting. They had to be confident on the underlying mechanisms of this tool (mathematically speaking) that is why it became a research project. We first delve into the mathematics of the problem.

Mathematical reinterpretation

Model

We consider a random vector $\boldsymbol{X}=(X_1,X_d)$ where $X_j$ can be either continuous or qualitative (with $o_j$ distinct values). We denote by $\boldsymbol{\mathfrak{q}}=(\mathfrak{q}_1,\mathfrak{q}_d)$ the quantized random vector where $\mathfrak{q}_j$ is the quantization of $X_j$, i.e. qualitative with $m_j$ values corresponding either to unique intervals of $X_j$ (continuous case) or to unique regroupments of $X_j$'s $o_j$ values (which implies $m_j \leq o_j$).

We suppose that by quantizing features $\boldsymbol{X}$, we preserve all information about the target feature, i.e. $p(y|\boldsymbol{x},\boldsymbol{\mathfrak{q}}) = p(y|\boldsymbol{\mathfrak{q}})$.

Hard optimization problem

Although this process seems straightforward, it is a rather complicated optimization problem in terms of combinatorics and estimation (being a discrete problem).

Choosing the ‘‘right'' model: which criterion?

The task is to find the optimal logistic regression $p_{\boldsymbol{\theta}}(y|\boldsymbol{\mathfrak{q}})$ where $\boldsymbol{\mathfrak{q}}$ is unknown and must be chosen in a set $\mathbf{\mathfrak{Q}}{\boldsymbol{m}}$ very large as said in the previous section. So the model selection problem can be expressed in terms of classical criteria, e.g. AIC, BIC, \ldots where \Theta and $\mathbf{\mathfrak{Q}}{\boldsymbol{m}}$ have to be scanned:

$$ (\hat{\boldsymbol{\theta}}, \hat{\boldsymbol{\mathfrak{q}}}) = \arg \max_{(\boldsymbol{\theta},\boldsymbol{\mathfrak{q}})} \text{AIC}(p_{\boldsymbol{\theta}}(\mathbf{y}|\mathbf{\mathfrak{q}})).$$

The need for generating ‘‘clever'' candidates

The criterion developed in the previous part cannot be optimized directly because the set $\mathbf{\mathfrak{Q}}_{\boldsymbol{m}}$ of candidate discretizations is too large. The idea behind most existing supervised discretization method is to generate potentially ‘‘good'' candidates (although most methods don't depend on the predictive algorithm to be applied after discretization).

By doing so, we reduce elements of $\mathbf{\mathfrak{Q}}_{\boldsymbol{m}}$ to the generated candidates which is a considerably smaller set, with most existing methods outputing only one discretization scheme. We go through a few of them in the next section.

We proposed a discretization algorithm that meets these criteria and naturally . It was described in .

In short, the algorithm considers the discretized features $\boldsymbol{\mathfrak{q}}$ as latent variables which we will generate from its estimated a posteriori density function as part of an SEM-algorithm (see ). The algorithm alternates between:

  1. Fitting a logistic regression between $\boldsymbol{\mathfrak{q}}$ and $y$;
  2. Fitting polytomous logistic regression functions between each pair of $(x_j,\mathfrak{q}_j)$;
  3. Generate new discretized features $E$ by sampling them, as $p(\mathfrak{q}j|\mathfrak{q}{-{j}},\boldsymbol{x},y) \propto p(\mathfrak{q}_j|x_j) p(y|\boldsymbol{\mathfrak{q}})$.

The approach is implemented in the glmdisc package and is not part of this package. Please refer to its vignette (by typing vignette("glmdisc")).

Existing methods

The discretization package

In this package, we wrapped functions of the discretization package in a unified S4 class discretization.

x <- matrix(runif(300), nrow = 100, ncol = 3)
cuts <- seq(0, 1, length.out = 4)
xd <- apply(x, 2, function(col) as.numeric(cut(col, cuts)))
theta <- t(matrix(c(0, 0, 0, 2, 2, 2, -2, -2, -2), ncol = 3, nrow = 3))
log_odd <- rowSums(t(sapply(seq_along(xd[, 1]), function(row_id) {
  sapply(
    seq_along(xd[row_id, ]),
    function(element) theta[xd[row_id, element], element]
  )
})))
y <- stats::rbinom(100, 1, 1 / (1 + exp(-log_odd)))

discrete_modele <- chi2_iter(x, y)

Methods

print method

The print method shows the method used and the coefficients of infered_model, much like you would get from printing a \code{glm} object:

print(discrete_modele)

summary method

The summary method shows the method used alongside the coefficients of the financed model, eventually the acceptance model and the infered model with AIC values, much like you would get from doing a summary on a \code{glm} object:

summary(discrete_modele)

predict method

The predict method corresponds to the glm predict method for the infered model:

predict(discrete_modele, x)

plot method

The plot method corresponds to the glm plot method for the infered model:

# NOT RUN since re-building this vignette fails on some distros (Mac OS and Solaris) of CRAN (because plot uses plotly).
plot(discrete_modele, type = "ROC")

Segmentation: logistic regression trees

Classically, Credit Scoring practioners perform ‘‘segmentation'' before learning a scorecard, i.e. they try to find homogeneous segments of clients, e.g. ‘‘young clients'', ``home owners'', and combine them. They then perform scoring locally, on the resulting leaves of this decision tree. We thus proposed an algorithm to fit this tree and the logistic regressions at its leaves, which we call glmtree. The resulting model mixes non parametric VS parametric and stepwise VS linear approaches to have the best predictive results, yet maintaining interpretability.

The approach is implemented in the glmtree package required by this package. Please refer to its vignette (by typing vignette("glmtree")).



adimajo/scoring documentation built on March 7, 2024, 11:18 p.m.