In dchudz/predcomps: Average Predictive Comparisons

load(file="../examples/wine-logistic-regression.RData")
library(ggplot2)
theme_set(theme_gray(base_size = 18))
library(grid)
library(gridExtra)
library(knitr)
library(predcomps)
opts_chunk$set(fig.cap="", echo=FALSE, 
               fig.width=2*opts_chunk$get("fig.width"),
               fig.height=.9*opts_chunk$get("fig.height")
               )

An R Package to Help Interpret Predictive Models:

(Average) Predictive Comparisons

David Chudzicki

These slides: http://www.davidchudzicki.com/predcomps/presentation-alpine-lunch.html
Package documentation: http://www.davidchudzicki.com/predcomps/
Package source: https://github.com/dchudz/predcomps/
Slide source: https://github.com/dchudz/predcomps/blob/master/notes/presentations/AlpineLunch.Rmd

Related concepts

sensitivity analysis
elasticity
variable importance
partial dependence
etc.

This approach:

treats model as a black box
tries to properly account for relationships among the inputs of interest
based on Gelman & Pardoe (2007), which asked: "On average, much does the output increase per unit increase in input?"

Plan

Motivating fake example
Predictive comparisons definitions: what we want
Applying average predictive comparisons to (1)
An example with real data: credit scoring
Discussion!
(Appendix) Estimation & Computation: how to get what we want
(Appendix) Comparison with other approaches

Example will show:

Logistic regression coefficients may not be the right summary for a logistic regression
Relationships among the inputs matter for measuring influence of each variable

Silly Fake Example

We sell wine
Wine varies in: price, quality
Customers randomly do/don't buy, depending on price and quality

Logistic Regression

$P$: price ($)
$Q$: quality (score on arbitrary scale)
Model: $P(\text{wine is purchased}) = logit^{-1}(\beta_0 + \beta_1 Q + \beta_2 P)$

library(boot)
s <- seq(-4,4,by=.01)
qplot(s, 1/(1+exp(-s)), geom="line") + ggtitle("Inverse Logit Curve")

difficulty: interpreting coefficients on logit scale

True model: $P(\text{wine is purchased}) = logit^{-1}(0.1 Q - 0.12 P)$

Distribution of Inputs (variation 1):

Price and quality are (noisily) related:

myScales <- list(scale_x_continuous(limits=c(-15,125)),
                 scale_y_continuous(limits=c(0,1)))
qualityScale <- ylim(c(-20,130))
qplot(Price, Quality, alpha=I(.5), data = df1Sample) + 
  qualityScale

We don't really need a model to understand this...

(A random subset of the data. For clarity, showing only a discrete subset of prices.)

v1Plot <- ggplot(subset(df1Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") +
  myScales +
  scale_color_discrete("Price")
v1Plot

We don't really need a model to understand this...

For each individual price, quality vs. purchase probability forms a portion of a shifted inverse logit curve:

last_plot() + geom_line(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
                        data = linesDF,
                        size=.2)

how changes in price/quality affect $P(\text{wine is purchased})$ depends a lot on where you are in input space

Variation 2

In another possible world, mid-range wines are more common:

qplot(Price, Quality, alpha=I(.5), data = df2Sample) + 
  qualityScale

input distribution is changed
... but model is not changed

In this world, quality matters more....

(more precisely, cases where quality matters are more common)

v2Plot <- ggplot(subset(df2Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") +
  myScales +
  scale_color_discrete("Price")
v2Plot

Variation 3

In a third possible world, price varies more strongly with quality:

qplot(Price, Quality, alpha=I(.5), data = df3Sample) + 
  qualityScale

Again:

input distribution is changed
... but model is not changed, still $P(\text{wine is purchased}) = logit^{-1}(\beta_0 + \beta_1 Q + \beta_2 P)$

In this world, quality matters even more...

(more precisely: quality matters in all cases that we actually see)

v3Plot <- ggplot(subset(df3Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") +
  myScales +
  scale_color_discrete("Price")
v3Plot

Now quality matters more

... across all price ranges (for the kinds of variation that we see in the data)

v3PlotWithCurves <- last_plot() + geom_line(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
                                            data = linesDF,
                                            size=.2)
v3PlotWithCurves

Lessons from fake example

In each variation, the influence of the Quality changed without the regression coefficients changing
Any approach to summarizing the influence should be sensitive to relationships among the input

Headed toward single-summary summaries of average influence

apcComparisonPlot <- ggplot(subset(apcAllVariations, Input=="Quality")) +
  geom_bar(aes(x=factor(Variation, levels=3:1), y=PerUnitInput.Signed), stat="identity", width=.5) + 
  expand_limits(y=0) +
  xlab("Variation") + 
  ggtitle("APC (Per Unit) for Quality across Variations") +
  coord_flip()
apcComparisonPlot
grid.arrange(v1Plot + ggtitle("V1"), v2Plot + ggtitle("V2"), v3Plot + ggtitle("V3"), nrow=1)

Generalizing Linear Regression

nPoints <- 20
set.seed(78)
UnderlyingX <- rnorm(nPoints)
df <- data.frame(X1=UnderlyingX + rnorm(nPoints), X2=UnderlyingX + rnorm(nPoints))
df$NewX1 <- df$X1 + abs(.1*rnorm(nPoints)) # only positive transitions, for simplicity
df$Y <- df$X1 + df$X2
df$NewY <- df$NewX1 + df$X2

ggplot(df) + 
  geom_segment(aes(x=X1, y=Y, xend=NewX1, yend = NewY), arrow = arrow(length = unit(0.1,"cm"))) +
  ggtitle("In linear model, per-unit change is consistent however you measure it")

Generalizing Linear Regression

nPoints <- 20
set.seed(78)
UnderlyingX <- rnorm(nPoints)
df <- data.frame(X1=UnderlyingX + rnorm(nPoints), X2=UnderlyingX + rnorm(nPoints))
df$NewX1 <- df$X1 + abs(.1*rnorm(nPoints)) # only positive transitions, for simplicity
df$Y <- df$X1 + df$X2
df$NewY <- df$NewX1 + df$X2

ggplot(df) + 
  geom_segment(aes(x=X1, y=Y, xend=NewX1, yend = NewY), arrow = arrow(length = unit(0.1,"cm"))) +
  ggtitle("In linear model, per-unit change is consistent however you measure it")

df$Y <- df$X1 * df$X2
df$NewY <- df$NewX1 * df$X2

ggplot(df) + 
  geom_segment(aes(x=X1, y=Y, xend=NewX1, yend = NewY), arrow = arrow(length = unit(0.1,"cm"))) +
  ggtitle("In non-linear model, per unit change depends on X1 values and other inputs")

Goal for single-number summaries

These concepts are vague, but keep them in mind as we try to formalize things in the next few slides:

For each input, what is the average change in output per unit change in input? (generalizes linear regression, units depend on units for input)
How important is each input in influencing the output? (units should be consistent across inputs -- think of standardized regression coefficients)

Some notation

$u$: the variable under consideration

$v$: the vector of other variables (the "all else held equal")

$f(u,v)$: a function that makes predictions, e.g. maybe $f(u,v) = \mathcal{E}[y \mid u, v, \theta]$

We consider transitions in $u$ holding $v$ constant, e.g. $$\frac{f(u_2, v) - f(u_1, v)}{u_2-u_1}$$
To get one-point summaries we'll take an average
All of the subtlety lies in the choice of $v$, $u_1$, $u_2$

What average do we take?

The APC is defined as

$$\frac{\mathcal{E}[\Delta_f]}{\mathcal{E}[\Delta_u]}$$

where

$\Delta_f = f(u_2,v) - f(u_1,v)$
$\Delta_u = u_2 - u_1$
$\mathcal{E}$ is expectation under the following process:
sample $v$ from the (marginal) distribution of the corresponding inputs
sample $u_1$ and $u_2$ independently from the distribution of $u$ conditional on $v$

Variations

"Impact" (my idea; help me with naming!) is just the expected value of $\Delta_f = f(u_2,v) - f(u_1,v)$
Absolute versions (of both impact and APC) use $\mathcal{E}[|\Delta_f|]$ and $\mathcal{E}[|\Delta_u|]$

Computation

Once we've said what we want, computing/estimating it isn't trivial
More on this in the discussion (depending on time/interest)

Returning to the wines...

apcComparisonPlot <- ggplot(subset(apcAllVariations, Input=="Quality")) +
  geom_bar(aes(x=factor(Variation, levels=3:1), y=PerUnitInput.Signed), stat="identity", width=.5) + 
  expand_limits(y=0) +
  xlab("Variation") + 
  ggtitle("APC (Per Unit) for Quality across Variations") +
  coord_flip()
apcComparisonPlot
grid.arrange(v1Plot + ggtitle("V1"), v2Plot + ggtitle("V2"), v3Plot + ggtitle("V3"), nrow=1)

Exercise for the reader: Make an example where APC is larger than in Variation 1 but "Impact" is much smaller.

Credit Scoring Example

load(file="../examples/loan-defaults.RData")

Target:

SeriousDlqin2yrs (target variable, 7% "yes"): Person experienced 90 days past due delinquency or worse

Features:

age
NumberOfTime30-59DaysPastDueNotWorse
NumberOfTime60-89DaysPastDueNotWorse
NumberOfTimes90DaysLate
DebtRatio
...(etc., 10 features total)

Input Distribution

histograms <- Map(function(colName) {
  qplot(credit[[colName]]) + 
    ggtitle(colName) +
    xlab("")},
  c("age", "NumberOfTime30.59DaysPastDueNotWorse", "NumberOfTime60.89DaysPastDueNotWorse", "NumberOfTimes90DaysLate"))
allHistograms <- do.call(arrangeGrob, c(histograms, ncol=2))
allHistograms

Note: previous lateness (esp. 90+) days is rare.

Model Building

We'll use a random forest for this example:

set.seed(1)
# Turning the response to type "factor" causes the RF to be build for classification:
credit$SeriousDlqin2yrs <- factor(credit$SeriousDlqin2yrs) 
rfFit <- randomForest(SeriousDlqin2yrs ~ ., data=credit, ntree=ntree)

Aggregate Predictive Comparisons

set.seed(1)
apcDF <- GetPredCompsDF(rfFit, credit,
                        numForTransitionStart = numForTransitionStart,
                        numForTransitionEnd = numForTransitionEnd,
                        onlyIncludeNearestN = onlyIncludeNearestN)

kable(apcDF, row.names=FALSE)

Impact Plot

(Showing +/- the absolute impact, since signed impact is bounded between those numbers)

(Showing impact rather than APC b/c the different APC units wouldn't be comparable, shouldn't go on one chart)

PlotPredCompsDF(apcDF)

Summaries like this can guide questions that push is to dig deeper, like:

What's going on with age? (To get the cancellation we see, it must have strong effects that vary in sign)
Why don't instances of previous lateness always increase your probability of a 90-days-past-due incident? (cancellation?)

Goals for sensitivity analysis

want something that properly accounts for relationships among variables of interest
want something that emphasizes the values that are plausible given the other values (two reasons: this is what we care about, and this is what our model has better estimates of)

v3PlotWithCurves + ggtitle("Wines: Variation 3")

How we'll do sensitivity analysis

choose a few random values for $v$ (the "all else held equal")
sample from $u$ conditional on $v$
plot $u$ vs. the prediction for each $v$

ggplot(pairsSummarizedAge, aes(x=age.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2) +
  xlab("age") +
  ylab("Prediction") + 
  guides(color = FALSE)

Zooming in...

ggplot(subset(pairsSummarizedAge, OriginalRowNumber <= 8), 
       aes(x=age.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2) +
  xlab("age") +
  ylab("Prediction") + 
  guides(color = FALSE) + 
  coord_cartesian(ylim=(c(-.01,.2)))

shows off some weird behavior of the model!
we should dig deeper, get more comfortable with the model
try other models

Sensitivity: Number of Time 30-35 Days Past Due

Mostly we see the increasing probability that we'd expect...

ggplot(pairsSummarized, aes(x=NumberOfTime30.59DaysPastDueNotWorse.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(size=.2) +
  scale_x_continuous(limits=c(0,2)) +
  scale_size_area() + 
  xlab("NumberOfTime30.59DaysPastDueNotWorse") +
  ylab("Prediction") + 
  guides(color = FALSE)

Sensitivity: Number of Time 30-35 Days Past Due

... but in one case, probability of default decreases with the 0-to-1 transition

ggplot(pairsSummarized, aes(x=NumberOfTime30.59DaysPastDueNotWorse.B, y=yHat2, color=factor(OriginalRowNumber))) + 
  geom_point(aes(size = Weight)) +
  geom_line(aes(alpha=ifelse(OriginalRowNumber == 18, 1, .3))) +
  scale_x_continuous(limits=c(0,2)) +
  scale_alpha_identity() +
  scale_size_area() + 
  xlab("NumberOfTime30.59DaysPastDueNotWorse") +
  ylab("Prediction") + 
  guides(color = FALSE)

Can we explain it?

oneRowWithDecreasingDefaultProbability <- oneOriginalRowNumber[1,intersect(names(oneOriginalRowNumber), names(credit))]
kable(oneRowWithDecreasingDefaultProbability[,1:5])
kable(oneRowWithDecreasingDefaultProbability[,6:8])
kable(oneRowWithDecreasingDefaultProbability[,9:10])

Alternative Approach to Sensitivity Analysis: "Partial Plots"

e.g. partialPlot function in randomForest library in R

This accounts for relationships among the "all else held equal" but not between those and the input under consideration

Make predictions on a new data set constructed as follows:

oneRow <- data.frame(v1=1, v2="a", v3=2.3, u=6)

One row:

kable(oneRow)

Repeat the row varying $u$ across its whole range:

oneRowRep <- oneRow[rep(1,20),]
oneRowRep$u <- 1:20
kable(oneRowRep)

Then average predictions grouping by $u$ (or not)

Partial Plot Method Applied to Wines

v3Plot + geom_point(aes(x=Quality, y=PurchaseProbability), stat="summary", fun.y="mean", data=linesDF) +
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
                        data = linesDF,
                        size=1) +
  ggtitle("(V3)")

Comparison with other approaches

Things that vary

Units - depend on input or consistent across inputs
Sensitive to univariate distribution of inputs
Sensitive to dependence between inputs
Shows shape of non-linearity
Signed
Model
Based on holdout performance

Computing the APC

sampling a $v$ is easy (take a row from your data)
sampling $u_1$ given $v$ is easy (take the $u$ in that row)
sampling another $u$ ($u_2$) is harder
would be easy if you had a lot of data with the same $v$
so take $u$'s from rows of data where the correspinding $v$ is close to your $v$

In Practice

Gelman & Pardoe look at all pairs of rows and assign weights $$\frac{1}{1 + d}$$
I think this was giving too much weight to faraway points (volume of $n$-spheres grows much faster than linearly in $d$)
I use Gelman's weights, but only looking at a fixed number of the closest points
Computational advantage: fewer points to deal with
A few example

pairsOrdered <- pairs[order(pairs$OriginalRowNumber),]

for (i in 1:20) {
  cat("\n\n")
  kable(head(subset(pairsOrdered, OriginalRowNumber==i)[c("OriginalRowNumber", "age", "DebtRatio", "MonthlyIncome", "NumberOfOpenCreditLinesAndLoans", "NumberOfTime30.59DaysPastDueNotWorse", "NumberOfTime30.59DaysPastDueNotWorse.B","yHat1","yHat2", "Weight")]))
}

Better?

Explicitly model the distribution $p(u|v)$?

E.g. using BART (Bayesian Additive Regression Trees)

dchudz/predcomps documentation built on May 15, 2019, 1:48 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dchudz/predcomps Average Predictive Comparisons

In dchudz/predcomps: Average Predictive Comparisons

An R Package to Help Interpret Predictive Models:

(Average) Predictive Comparisons

Related concepts

This approach:

Plan

Example will show:

Silly Fake Example

Logistic Regression

Distribution of Inputs (variation 1):

We don't really need a model to understand this...

We don't really need a model to understand this...

Variation 2

In this world, quality matters more....

Variation 3

In this world, quality matters even more...

Now quality matters more

Lessons from fake example

Headed toward single-summary summaries of average influence

Generalizing Linear Regression

Generalizing Linear Regression

Goal for single-number summaries

Some notation

What average do we take?

Variations

Computation

Returning to the wines...

Credit Scoring Example

Input Distribution

Model Building

Aggregate Predictive Comparisons

Impact Plot

Goals for sensitivity analysis

How we'll do sensitivity analysis

Zooming in...

Sensitivity: Number of Time 30-35 Days Past Due

Sensitivity: Number of Time 30-35 Days Past Due

Can we explain it?

Alternative Approach to Sensitivity Analysis: "Partial Plots"

Partial Plot Method Applied to Wines

Comparison with other approaches

Computing the APC

In Practice

Better?

R Package Documentation

Browse R Packages

We want your feedback!

dchudz/predcomps
Average Predictive Comparisons