In dchudz/predcomps: Average Predictive Comparisons

Changes to make:

Flesh out motivating examples in much more detail
logistic regression: more explanation / focus on logistic regression equation
maybe nothing different for other example?
For ML audience: lead more on how you would use it by switching order to:
wine as is
calling API on wine
credit
THEN the math definitions
comparison with other approaches

Anthony feedback:

Say earlier why we're doing wine example
why are we looking at wine example?
picture in motivation for the equations
do the exercise live
don't need histograms or feature definitions

load(file="../examples/wine-logistic-regression.RData")
library(ggplot2)
theme_set(theme_gray(base_size = 18))
library(gridExtra)
library(knitr)
library(predcomps)
opts_chunk$set(fig.cap="", echo=FALSE, 
               fig.width=2*opts_chunk$get("fig.width"),
               fig.height=.9*opts_chunk$get("fig.height")
               )

An R Package to Help Interpret Predictive Models:

(Average) Predictive Comparisons

David Chudzicki

Related concepts

sensitivity analysis
elasticity
variable importance
partial dependence
etc.

This approach:

treats model as a black box
tries to properly account for relationships among the inputs of interest
based on Gelman & Pardoe (2007), which asked: "On average, much does the output increase per unit increase in input?"

Plan

Motivating fake example
Predictive comparisons definitions: what we want
Applying average predictive comparisons to (1)
An example with real data: credit scoring
Discussion!
(Appendix) Estimation & Computation: how to get what we want
(Appendix) Comparison with other approaches

Silly Example

We sell wine
Wine varies in: price, quality
Customers randomly do/don't buy, depending on price and quality

Logistic Regression

$P$: price ($)
$Q$: quality (score on arbitrary scale)
Model: $P(\text{wine is purchased}) = logit^{-1}(\beta_0 + \beta_1 Q + \beta_2 P)$

library(boot)
s <- seq(-4,4,by=.01)
qplot(s, 1/(1+exp(-s)), geom="line") + ggtitle("Inverse Logit Curve")

difficulty: interpreting coefficients on logit scale

True model: $P(\text{wine is purchased}) = logit^{-1}(0.1 Q - 0.12 P)$

Distribution of Inputs (variation 1):

Price and quality are (noisily) related:

myScales <- list(scale_x_continuous(limits=c(-15,125)),
                 scale_y_continuous(limits=c(0,1)))
qualityScale <- ylim(c(-20,130))
qplot(Price, Quality, alpha=I(.5), data = df1Sample) + 
  qualityScale

We don't really need a model to understand this...

(A random subset of the data. For clarity, showing only a discrete subset of prices.)

v1Plot <- ggplot(subset(df1Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") +
  myScales +
  scale_color_discrete("Price")
v1Plot

We don't really need a model to understand this...

For each individual price, quality vs. purchase probability forms a portion of a shifted inverse logit curve:

last_plot() + geom_line(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
                        data = linesDF,
                        size=.2)

how changes in price/quality affect $P(\text{wine is purchased})$ depends a lot on where you are in input space

Variation 2

In another possible world, mid-range wines are more common:

qplot(Price, Quality, alpha=I(.5), data = df2Sample) + 
  qualityScale

input distribution is changed
... but model is not changed

In this world, quality matters more....

(more precisely, cases where quality matters are more common)

v2Plot <- ggplot(subset(df2Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") +
  myScales +
  scale_color_discrete("Price")
v2Plot

Variation 3

In a third possible world, price varies more strongly with quality:

qplot(Price, Quality, alpha=I(.5), data = df3Sample) + 
  qualityScale

Again:

input distribution is changed
... but model is not changed, still $P(\text{wine is purchased}) = logit^{-1}(\beta_0 + \beta_1 Q + \beta_2 P)$

In this world, quality matters even more...

(more precisely: quality matters in all cases that we actually see)

v3Plot <- ggplot(subset(df3Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") +
  myScales +
  scale_color_discrete("Price")
v3Plot

Now quality matters more

... across all price ranges (for the kinds of variation that we see in the data)

v3PlotWithCurves <- last_plot() + geom_line(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
                                            data = linesDF,
                                            size=.2)
v3PlotWithCurves

Lessons from this example

We want to interpret things on the scale we care about (probability in this case)
Relationships among the inputs matter

Goal is single-number summaries

These concepts are vague, but keep them in mind as we try to formalize things in the next few slides:

For each input, what is the average change in output per unit change in input? (generalizes linear regression, units depend on units for input)
How important is each input in influencing the output? (units should be consistent across inputs -- think of standardized regression coefficients)

Some notation

$u$: the variable under consideration

$v$: the vector of other variables (the "all else held equal")

$f(u,v)$: a function that makes predictions, e.g. maybe $f(u,v) = \mathcal{E}[y \mid u, v, \theta]$

We consider transitions in $u$ holding $v$ constant, e.g. $$\frac{f(u_2, v) - f(u_1, v)}{u_2-u_1}$$
To get one-point summaries we'll take an average
All of the subtlety lies in the choice of $v$, $u_1$, $u_2$

What average do we take?

The APC is defined as

$$\frac{\mathcal{E}[\Delta_f]}{\mathcal{E}[\Delta_u]}$$

where

$\Delta_f = f(u_2,v) - f(u_1,v)$
$\Delta_u = u_2 - u_1$
$\mathcal{E}$ is expectation under the following process:
sample $v$ from the (marginal) distribution of the corresponding inputs
sample $u_1$ and $u_2$ independently from the distribution of $u$ conditional on $v$

Variations

"Impact" (my idea; help me with naming!) is just the expected value of $\Delta_f = f(u_2,v) - f(u_1,v)$
Absolute versions (of both impact and APC) use $\mathcal{E}[|\Delta_f|]$ and $\mathcal{E}[|\Delta_u|]$

Computation

Once we've said what we want, computing/estimating it isn't trivial
More on this in the discussion (depending on time/interest)

Returning to the wines...

apcComparisonPlot <- ggplot(subset(apcAllVariations, Input=="Quality")) +
  geom_bar(aes(x=factor(Variation, levels=3:1), y=PerUnitInput.Signed), stat="identity", width=.5) + 
  expand_limits(y=0) +
  xlab("Variation") + 
  ggtitle("APC for Quality across Variations") +
  coord_flip()
apcComparisonPlot
grid.arrange(v1Plot + ggtitle("V1"), v2Plot + ggtitle("V2"), v3Plot + ggtitle("V3"), nrow=1)

Exercise for the reader: Make an example where APC is larger than in Variation 1 but "Impact" is much smaller.

Credit Scoring Example

load(file="../examples/loan-defaults.RData")

SeriousDlqin2yrs (target variable, 7% "yes"): Person experienced 90 days past due delinquency or worse
RevolvingUtilizationOfUnsecuredLines: Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits
age: Age of borrower in years
NumberOfTime30-59DaysPastDueNotWorse: Number of times borrower has been 30-59 days past due but no worse in the last 2 years.
NumberOfTime60-89DaysPastDueNotWorse: Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfTimes90DaysLate: Number of times borrower has been 90 days or more past due.
DebtRatio: Monthly debt payments, alimony,living costs divided by monthy gross income
MonthlyIncome: Monthly income
NumberOfOpenCreditLinesAndLoans: Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)
NumberRealEstateLoansOrLines: Number of mortgage and real estate loans including home equity lines of credit
NumberOfDependents: Number of dependents in family excluding themselves (spouse, children etc.)

Input Distribution

allHistograms

Note: previous lateness (esp. 90+) days is rare.

Model Building

We'll use a random forest for this example:

set.seed(1)
# Turning the response to type "factor" causes the RF to be build for classification:
credit$SeriousDlqin2yrs <- factor(credit$SeriousDlqin2yrs) 
rfFit <- randomForest(SeriousDlqin2yrs ~ ., data=credit, ntree=ntree)

Aggregate Predictive Comparisons

set.seed(1)
apcDF <- GetPredCompsDF(rfFit, credit,
                        numForTransitionStart = numForTransitionStart,
                        numForTransitionEnd = numForTransitionEnd,
                        onlyIncludeNearestN = onlyIncludeNearestN)

kable(apcDF, row.names=FALSE)

Impact Plot

(Showing +/- the absolute impact, since signed impact is bounded between those numbers)

(Showing impact rather than APC b/c the different APC units wouldn't be comparable, shouldn't go on one chart)

PlotPredCompsDF(apcDF)

Summaries like this can guide questions that push is to dig deeper, like:

What's going on with age? (To get the cancellation we see, it must have strong effects that vary in sign)
Why don't instances of previous lateness always increase your probability of a 90-days-past-due incident? (cancellation?)

Goals for sensitivity analysis

want something that properly accounts for relationships among variables of interest
want something that emphasizes the values that are plausible given the other values (two reasons: this is what we care about, and this is what our model has better estimates of)

v3PlotWithCurves + ggtitle("Wines: Variation 3")

How we'll do sensitivity analysis

choose a few random values for $v$ (the "all else held equal")
sample from $u$ conditional on $v$
plot $u$ vs. the prediction for each $v$

ggplot(pairsSummarizedAge, aes(x=age.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2) +
  xlab("age") +
  ylab("Prediction") + 
  guides(color = FALSE)

Zooming in...

ggplot(subset(pairsSummarizedAge, OriginalRowNumber <= 8), 
       aes(x=age.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2) +
  xlab("age") +
  ylab("Prediction") + 
  guides(color = FALSE) + 
  coord_cartesian(ylim=(c(-.01,.2)))

shows off some weird behavior of the model!
we should dig deeper, get more comfortable with the model
try other models

Sensitivity: Number of Time 30-35 Days Past Due

Mostly we see the increasing probability that we'd expect...

ggplot(pairsSummarized, aes(x=NumberOfTime30.59DaysPastDueNotWorse.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2) +
  scale_x_continuous(limits=c(0,2)) +
  scale_size_area() + 
  xlab("NumberOfTime30.59DaysPastDueNotWorse") +
  ylab("Prediction") + 
  guides(color = FALSE)

Sensitivity: Number of Time 30-35 Days Past Due

... but in one case, probability of default decreases with the 0-to-1 transition

ggplot(pairsSummarized, aes(x=NumberOfTime30.59DaysPastDueNotWorse.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(aes(alpha=ifelse(OriginalRowNumber == 18, 1, .3))) +
  scale_x_continuous(limits=c(0,2)) +
  scale_alpha_identity() +
  scale_size_area() + 
  xlab("NumberOfTime30.59DaysPastDueNotWorse") +
  ylab("Prediction") + 
  guides(color = FALSE)

Can we explain it?

oneRowWithDecreasingDefaultProbability <- oneOriginalRowNumber[1,intersect(names(oneOriginalRowNumber), names(credit))]
kable(oneRowWithDecreasingDefaultProbability[,1:5])
kable(oneRowWithDecreasingDefaultProbability[,6:8])
kable(oneRowWithDecreasingDefaultProbability[,9:10])

Alternative Approach to Sensitivity Analysis: "Partial Plots"

e.g. partialPlot function in randomForest library in R

This accounts for relationships among the "all else held equal" but not between those and the input under consideration

Make predictions on a new data set constructed as follows:

oneRow <- data.frame(v1=1, v2="a", v3=2.3, u=6)

One row:

kable(oneRow)

Repeat the row varying $u$ across its whole range:

oneRowRep <- oneRow[rep(1,20),]
oneRowRep$u <- 1:20
kable(oneRowRep)

Then average predictions grouping by $u$ (or not)

Partial Plot Method Applied to Wines

v3Plot + geom_point(aes(x=Quality, y=PurchaseProbability), stat="summary", fun.y="mean", data=linesDF) +
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
                        data = linesDF,
                        size=1) +
  ggtitle("(V3)")

Comparison with other approaches

Things that vary

Units - depend on input or consistent across inputs
Sensitive to univariate distribution of inputs
Sensitive to dependence between inputs
Shows shape of non-linearity
Signed
Model
Based on holdout performance

Computing the APC

sampling a $v$ is easy (take a row from your data)
sampling $u_1$ given $v$ is easy (take the $u$ in that row)
sampling another $u$ ($u_2$) is harder
would be easy if you had a lot of data with the same $v$
so take $u$'s from rows of data where the correspinding $v$ is close to your $v$

In Practice

Gelman & Pardoe look at all pairs of rows and assign weights $$\frac{1}{1 + d}$$
I think this was giving too much weight to faraway points (volume of $n$-spheres grows much faster than linearly in $d$)
I use Gelman's weights, but only looking at a fixed number of the closest points
Computational advantage: fewer points to deal with
A few example

pairsOrdered <- pairs[order(pairs$OriginalRowNumber),]

for (i in 1:20) {
  cat("\n\n")
  kable(head(subset(pairsOrdered, OriginalRowNumber==i)[c("OriginalRowNumber", "age", "DebtRatio", "MonthlyIncome", "NumberOfOpenCreditLinesAndLoans", "NumberOfTime30.59DaysPastDueNotWorse", "NumberOfTime30.59DaysPastDueNotWorse.B","yHat1","yHat2", "Weight")]))
}

dchudz/predcomps documentation built on May 15, 2019, 1:48 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dchudz/predcomps Average Predictive Comparisons

In dchudz/predcomps: Average Predictive Comparisons

An R Package to Help Interpret Predictive Models:

(Average) Predictive Comparisons

Related concepts

This approach:

Plan

Silly Example

Logistic Regression

Distribution of Inputs (variation 1):

We don't really need a model to understand this...

We don't really need a model to understand this...

Variation 2

In this world, quality matters more....

Variation 3

In this world, quality matters even more...

Now quality matters more

Lessons from this example

Goal is single-number summaries

Some notation

What average do we take?

Variations

Computation

Returning to the wines...

Credit Scoring Example

Input Distribution

Model Building

Aggregate Predictive Comparisons

Impact Plot

Goals for sensitivity analysis

How we'll do sensitivity analysis

Zooming in...

Sensitivity: Number of Time 30-35 Days Past Due

Sensitivity: Number of Time 30-35 Days Past Due

Can we explain it?

Alternative Approach to Sensitivity Analysis: "Partial Plots"

Partial Plot Method Applied to Wines

Comparison with other approaches

Computing the APC

In Practice

R Package Documentation

Browse R Packages

We want your feedback!

dchudz/predcomps
Average Predictive Comparisons