library(predcomps)
library(ggplot2)
library(boot) 
library(knitr)
opts_chunk$set(tidy = FALSE)
set.seed(1)

A Logistic Regression with Related Inputs

(The source code for this example is here.)

We will set up a simulated data set to use for modeling the probability a customer buys a bottle of wine, given its price and quality. We'll compare a few situation varying the joint distribution of price ($P$) and quality ($Q$). The coefficients of the logistic regression determining the relationship between the inputs and the probability of purchase will not vary.

In each variation, the probability of purchase is governed by the following logistic regression model:

$$logit(P(\text{wine is purchased})) = 0.1 Q - 0.12 P$$

The APC varies across these variations, but the logistic regression coefficients remain the same. The changes in APC in each of these variations are driven entirely by changes in the distribution of the inputs. The model relating inputs to outputs is unchanged.

Variation 1

In the first variation, quality and price are independent, with price uniformly distributed and quality set to price plus Gaussian noise:

priceCoef <- -.12
qualityCoef <- .1
qualityNoiseStdDev <- 5
nWines=50000
nRowsForPlottingSample <- 1000

numForTransitionStart <- 500
numForTransitionEnd <- 10000
onlyIncludeNearestN = 100

priceQualitySlope <- .4

df1 <- local({
  price <- sample(20:120, nWines, replace=TRUE)
  quality <- price * priceQualitySlope + 22 + rnorm(nWines, sd=qualityNoiseStdDev)
  purchaseProbability <- inv.logit(priceCoef*(price - 70) + qualityCoef*(quality - 50)  )
  purchased  <- rbinom(n = length(purchaseProbability), size=1, prob=purchaseProbability)
  data.frame(Quality = quality, 
             Price = price, 
             PurchaseProbability = purchaseProbability, 
             Purchased = purchased)
  })
print(getwd())

A scatter plot (using a random subset to avoid overplotting) shows us the relationship between price and quality:

df1Sample <- df1[sample.int(nWines, size=nRowsForPlottingSample), ]
qplot(Price, Quality, alpha=I(.5), data = df1Sample) + 
  expand_limits(y=c(0,100))

When we fit a logistic regression, the coefficients are what we'd expect from the setup above:

logitFit1 <- glm(Purchased ~ Price + Quality, data = df1, family = "binomial")
logitFit1

This plot shows the relationship between quality and probability of purchase for a few prices:

myScales <- list(scale_x_continuous(limits=c(0,100)),
                 scale_y_continuous(limits=c(0,1)))

ggplot(subset(df1Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") + myScales +
  scale_color_discrete("Price")

Each colored set of points is one portion of a shifted inverse logistic curve, determined by which Price/Quality combinations actually occur in our data.

We can also see the portion of the curve that isn't represented in our data:

linesDF <- expand.grid(Price = seq(20, 120, by=10), Quality = -20:130)
linesDF$PurchaseProbability <- with(linesDF,
                                    inv.logit(priceCoef*(Price - 70) + qualityCoef*(Quality - 50)))
last_plot() + geom_line(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
                        data = linesDF,
                        size=.2)

We can get average predictive comparisons from our fitted regression:

apc1 <- GetPredCompsDF(logitFit1, df1,
                       numForTransitionStart = numForTransitionStart,
                       numForTransitionEnd = numForTransitionEnd,
                       onlyIncludeNearestN = onlyIncludeNearestN)

The GetPredCompsDF function produces a few kinds of outputs, but for now let's just focus on the signed average predictive comparison:

apc1[c("Input", "PerUnitInput.Signed")]

This means that (on average) the probability of purchase increases by about 1.2% per 1 unit increase in quality.

Variation 2

This variation will add some additional wines to the middle range of prices:

nAdditionalWines <- nWines
supplementForDF2 <- local({
  price <- sample(55:85, nWines, replace=TRUE)
  quality <- price * .4 + 22 + rnorm(nWines, sd=qualityNoiseStdDev)
  purchaseProbability <- inv.logit(priceCoef*(price - 70) + qualityCoef*(quality - 50)  )
  purchased  <- rbinom(n = length(purchaseProbability), size=1, prob=purchaseProbability)
  data.frame(Quality = quality, 
             Price = price, 
             PurchaseProbability = purchaseProbability, 
             Purchased = purchased)
  })
df2 <- rbind(df1, supplementForDF2)

A scatter plot (again, using a random subset to avoid overplotting) shows us the relationship between price and quality:

df2Sample <- df2[sample.int(nrow(df2), size=nRowsForPlottingSample), ]
qplot(Price, Quality, alpha=I(.5), data = df2Sample) + 
  expand_limits(y=c(0,100))

When we fit a logistic regression, the coefficients are similar to before, since we haven't changed the underlying model:

logitFit2 <- glm(Purchased ~ Price + Quality, data = df2, family = "binomial")
logitFit2

In the plot showing the relationship between quality and probability of purchase, we see more points at the steep section of the inverse logit curve:

ggplot(subset(df2Sample, Price %in% seq(20, 120, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") +
  myScales + 
  scale_color_discrete("Price")

The APC for quality is correspondingly larger:

apc2 <- GetPredCompsDF(logitFit2, df2,
                       numForTransitionStart = numForTransitionStart,
                       numForTransitionEnd = numForTransitionEnd,
                       onlyIncludeNearestN = onlyIncludeNearestN)

apc2[c("Input",  "PerUnitInput.Signed")]

This means that in this variation the probability of purchase increases (on average) by about 1.5% (vs. 1.2% in Variation 1) per 1-point increase in quality. The magnitude of the APC for price is also larger.

Variation 3

This is just like Variation 1, but price increases more with quality:

priceQualitySlope <- 1.2

df3 <- local({
  price <- sample(20:120, nWines, replace=TRUE)
  quality <- price * priceQualitySlope - 30 + rnorm(nWines, sd=qualityNoiseStdDev)
  purchaseProbability <- inv.logit(priceCoef*(price - 70) + qualityCoef*(quality - 50)  )
  purchased  <- rbinom(n = length(purchaseProbability), size=1, prob=purchaseProbability)
  data.frame(Quality = quality, 
             Price = price, 
             PurchaseProbability = purchaseProbability, 
             Purchased = purchased)
  })
df3Sample <- df3[sample.int(nWines, size=nRowsForPlottingSample), ]
qplot(Price, Quality, alpha=I(.5), data = df3Sample) + 
  expand_limits(y=c(0,100))

The logistic regression still comes out the same:

logitFit3 <- glm(Purchased ~ Price + Quality, data = df3, family = "binomial")
logitFit3

In this case, purchase is less certain at the low prices and more plausible at the high prices:

ggplot(subset(df3Sample, Price %in% seq(-100, 200, by=10))) + 
  geom_point(aes(x = Quality, y = PurchaseProbability, color = factor(Price)), 
             size = 3, alpha = 1) + 
  ggtitle("Quality vs. Purchase Probability at Various Prices") + 
  myScales +
  scale_color_discrete("Price")

We can get average predictive comparisons from our fitted regression:

apc3 <- GetPredCompsDF(logitFit3, df3,
                       numForTransitionStart = numForTransitionStart,
                       numForTransitionEnd = numForTransitionEnd,
                       onlyIncludeNearestN = onlyIncludeNearestN)

As expected, the APCs are (both) larger than in Variation 1:

apc3[c("Input", "PerUnitInput.Signed")]

Comparing the Variations

Comparing all of the variation in one plot, we can see the increase in the effect of wine quality on purchase probability going from Variation 1 to Variation 3:

apc1$Variation <- 1
apc2$Variation <- 2
apc3$Variation <- 3

apcAllVariations <- do.call(rbind, list(apc1, apc2, apc3))
ggplot(subset(apcAllVariations, Input=="Quality")) +
  geom_point(aes(x=factor(Variation), y=PerUnitInput.Signed), size=3) + 
  expand_limits(y=0) +
  xlab("Variation") + 
  ggtitle("Per Unit Quality APC for Quality across Variations")
save.image(file="wine-logistic-regression.RData")


dchudz/predcomps documentation built on May 15, 2019, 1:48 a.m.