Credit Default Example

library(knitr)
knitr::opts_chunk$set(tidy = FALSE, message=FALSE, warning=FALSE, fig.align='center', fig.width = 14, size=8)
library(randomForest)
library(predcomps)
library(ggplot2)
theme_set(theme_gray(base_size = 18))
library(dplyr)
library(gridExtra)

This example is based on the training data set found here. We build a model to predict:

The input features are:

Parameters:

These are some parameters controlling the aggregate predictive comparisons:

# Parameters controlling the predictive comparisons computation:
#  We consider transitions starting at each of 500 random rows
numForTransitionStart <- 50
#  ... going to each of 10,000 other random rows:
numForTransitionEnd <- 1000
#  ... keeping only the nearest 100 pairs for each start:
onlyIncludeNearestN = 10

And for the random forest:

# 100 trees for random forest 
ntree = 10
# Remove some outliers
credit <- read.csv("~/Downloads/cs-training.csv")[,-1]
credit <- subset(credit, !is.na(MonthlyIncome) &
                    NumberOfTime30.59DaysPastDueNotWorse < 5 & 
                    RevolvingUtilizationOfUnsecuredLines <= 2 &
                    NumberOfTime30.59DaysPastDueNotWorse <= 5 &
                    NumberOfTime60.89DaysPastDueNotWorse <= 5 &
                    NumberOfTimes90DaysLate <= 5 &
                    MonthlyIncome < 5e4 &
                    DebtRatio < 2 &
                    NumberRealEstateLoansOrLines <= 12 &
                    NumberOfDependents < 10                  
                  )

Input Distribution

The distribution of the inputs (after removing some outliers to make things more manageable):

histograms <- Map(function(colName) {
  qplot(credit[[colName]]) + 
    ggtitle(colName) +
    xlab("")},
  setdiff(names(credit), "SeriousDlqin2yrs"))
allHistograms <- do.call(arrangeGrob, c(histograms, ncol=2))
print(allHistograms)

Build a random forest model:

set.seed(1)
# Turning the response to type "factor" causes the RF to be build for classification:
credit$SeriousDlqin2yrs <- factor(credit$SeriousDlqin2yrs) 
rfFit <- randomForest(SeriousDlqin2yrs ~ ., data=credit, ntree=ntree)

Aggregate Predictive Comparisons

set.seed(1)
apcDF <- GetPredCompsDF(rfFit, credit,
                        numForTransitionStart = numForTransitionStart,
                        numForTransitionEnd = numForTransitionEnd,
                        onlyIncludeNearestN = onlyIncludeNearestN)

Hi

kable(apcDF, row.names=FALSE)

Here impact chart. Its units are changes in probability, so we can compare it across all of the inputs:

PlotPredCompsDF(apcDF)

Note that average absolute value of the change in probability associated with changes in age is much larger than the magnitude of the signed average. This indicates either an interact effect between age and other inputs, or a non-linear (non-monotonic) relationship between age an probability of default. We'll look into that a bit later.

It wouldn't make sense to chart the average predictive comparisons in this way since they don't share units, but we can chart the inputs corresponding to a number of numbers late for various periods:

PlotPredCompsDF(apcDF[grep("NumberOfTime", apcDF$Input), ],
                variant = "Apc")

As you'd expect, greater periods of lateness are worse (per additional incident).

However, NumberOfTime30.59DaysPastDueNotWorse makes more overall difference to the model, because its variation is larger (it's non-zero more often):

PlotPredCompsDF(apcDF[grep("NumberOfTime", apcDF$Input), ])

More Detailed Examination: NumberOfTime30.59DaysPastDueNotWorse

Recall that data from which the summarized predictive comparisons are computed consists of groups of rows, where with in each group only the input of interest varies (for the point we imagine transitioning to) and the rest are held constant. We can work directly with this data, visualizing it in more detail to better understand our model:

set.seed(6)
pairs <- GetComparisonDF(rfFit, credit, 
                         u="NumberOfTime30.59DaysPastDueNotWorse",
                         numForTransitionStart = 20,
                         numForTransitionEnd = numForTransitionEnd*10,
                         onlyIncludeNearestN = onlyIncludeNearestN*10)

pairsSummarized <- pairs[c("OriginalRowNumber", "NumberOfTime30.59DaysPastDueNotWorse.B", "yHat2", "Weight")] %.% 
  group_by(OriginalRowNumber, NumberOfTime30.59DaysPastDueNotWorse.B, yHat2) %.% summarise(Weight = sum(Weight))

ggplot(pairsSummarized, aes(x=NumberOfTime30.59DaysPastDueNotWorse.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2) +
  scale_x_continuous(limits=c(0,2)) +
  scale_size_area()

I've made the size of the points proportional to weight that the point receives. The summarized predictive comparisons give more weight to points with more weight, and so should we.

The relationship is mostly as you'd expect, but let's examine the highlighted one in more detail:

ggplot(pairsSummarized, aes(x=NumberOfTime30.59DaysPastDueNotWorse.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = n)) +
  geom_line(aes(alpha=ifelse(OriginalRowNumber == 18, 1, .3))) +
  scale_x_continuous(limits=c(0,2)) +
  scale_alpha_identity()

Why does the default probability decrease when NumberOfTime30.59DaysPastDueNotWorse increases? Values for all the other inputs are held constant, so let's see what they are:

oneOriginalRowNumber <- subset(pairs, OriginalRowNumber == 18)
oneRow <- t(oneOriginalRowNumber[1,intersect(names(oneOriginalRowNumber), names(credit))])
colnames(oneRow) <- "Input values for highlighted row"
grid.newpage()
grid.table(oneRow)

Hmm, note that NumberOfTimes90DaysLate (almost always $0$) is $1$ in this case. Looking at the definition of the target variable (SeriousDlqin2yrs), this makes a lot of sense:

SeriousDlqin2yrs: "Person experienced 90 days past due delinquency or worse."

As we'd expect, previous instances of 90-days-late are a strong indicator of future ones. Adding a previous 30-days-late (but not more) to a previous 90-days-late seems to decrease the chance of future 90-days-lates. This is sensible -- with both, we have evidence that when you're late, you at least sometimes pay back in under 60 days.

Further exploratory analysis would further improve our understanding:

More Detailed Examination: Age

For one more example, let's examine the Age input in more detail.

set.seed(3)
pairs <- GetComparisonDF(rfFit, credit, 
                         u="age",
                         numForTransitionStart = 20,
                         numForTransitionEnd = numForTransitionEnd*10,
                         onlyIncludeNearestN = onlyIncludeNearestN*10)

pairsSummarized <- pairs[c("OriginalRowNumber", "age.B", "yHat2", "Weight")] %.% 
  group_by(OriginalRowNumber, age.B, yHat2) %.% 
  summarise(Weight = sum(Weight))

ggplot(pairsSummarized, aes(x=age.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2)

This is a bit of a mess, but we can at least see see that interaction effects and non-monotonicity are both going on.

Further exploration would look into which other inputs age is interacting with to determine these differently shaped curves.



dchudz/predcomps documentation built on May 15, 2019, 1:48 a.m.