John Mount, Win-Vector LLC 2018-12-03
Monotone (or isotone) regression via the isotone
package (also give scam
and gbm
var.monotone
a look, which should have the advantage of also being low complexity).
We will use the vtreat
package custom coder interface, which will supply cross-validated significance calculations and out-of sample interpolation (allowing us to apply the monotone transforms to new data). For a more substantial application of vtreat
custom coding please see the partial pooling application.
suppressPackageStartupMessages(library("ggplot2"))
library("seplyr")
## Loading required package: wrapr
source("isotone.R")
# set up example data
set.seed(23525)
d <- data.frame(x = 10*runif(200))
d$yIdeal <- d$x^2
d$yObserved <- d$yIdeal + 25*rnorm(nrow(d))
d$isTrain <- runif(nrow(d))<=0.5
ggplot(data=d, aes(x=x)) +
geom_line(aes(y=yIdeal), color='blue', linetype=2) +
geom_point(aes(y=yObserved, color=isTrain, shape=isTrain)) +
ylab('y') +
ggtitle("ideal and observed responses as functions of x",
subtitle = "dashed curve: ideal (pre-noise) values")
customCoders = list('n.NonDecreasingV.num' = solveNonDecreasing,
'n.NonIncreasingV.num' = solveNonIncreasing)
treatments <- vtreat::designTreatmentsN(d[d$isTrain, , drop=FALSE],
'x', 'yObserved',
customCoders = customCoders,
verbose = FALSE)
print(treatments$scoreFrame[, c('varName', 'rsq', 'sig', 'needsSplit'), drop=FALSE])
## varName rsq sig needsSplit
## 1 x_NonDecreasingV 0.5809673 9.905695e-18 TRUE
## 2 x_clean 0.6129088 3.320652e-19 FALSE
dTreated <- vtreat::prepare(treatments, d)
d$soln <- dTreated$x_NonDecreasingV
dTrain <- d[d$isTrain, , drop=FALSE]
# good inference on train
sum((dTrain$yIdeal - dTrain$soln)^2)
## [1] 7460.391
sum((dTrain$yIdeal - dTrain$yObserved)^2)
## [1] 60855.41
dTest <- d[!d$isTrain, , drop=FALSE]
# good performance on test
sum((dTest$yIdeal - dTest$soln)^2)
## [1] 14315.25
sum((dTest$yIdeal - dTest$yObserved)^2)
## [1] 75152.66
ggplot(data=d, aes(x=x)) +
geom_line(aes(y=yIdeal), color='blue', linetype=2) +
geom_point(aes(y=yObserved, color=isTrain, shape=isTrain)) +
geom_line(aes(x=x, y=soln), color='darkgreen') +
ylab('y') +
ggtitle("ideal and observed responses as functions of x",
subtitle = "solid path: isotone fit")
The above formulation is kind of exciting. You get one degree of freedom per data-row (a very large number), but a simple constraint system (that the produced predictions must follow the x-order constraints) is enough to produce reasonable fits. This reminiscent of the maximum entropy formulation of logistic regression, and is evidence one is working with a sort of dual-formulation of a smaller primal problem.
Some notes on smoother implementations can be found here.
We can also easily adapt to classification and to categorical inputs.
source("isotone.R")
# set up example data
set.seed(23525)
d <- data.frame(x = 10*runif(200))
d$yIdeal <- -d$x^2
d$yObserved <- d$yIdeal + 25*rnorm(nrow(d))
d$isTrain <- runif(nrow(d))<=0.5
threshold <- -50
d$yIdeal <- d$yIdeal >= threshold
d$yObserved <- d$yObserved >= threshold
# could also build link-space versions
customCoders = list('c.NonDecreasingV.num' = solveNonDecreasing,
'c.NonIncreasingV.num' = solveNonIncreasing)
treatments <- vtreat::designTreatmentsC(d[d$isTrain, , drop=FALSE],
'x', 'yObserved', TRUE,
customCoders = customCoders,
verbose = FALSE)
# examining variables
print(treatments$scoreFrame[, c('varName', 'rsq', 'sig', 'needsSplit'), drop=FALSE])
## varName rsq sig needsSplit
## 1 x_NonIncreasingV 0.3370146 2.856172e-10 TRUE
## 2 x_clean 0.3522417 1.138713e-10 FALSE
# copy fit over to original data frame
dTreated <- vtreat::prepare(treatments, d)
d$soln <- dTreated$x_NonIncreasingV
d$prediction <- d$soln>0.5
dTrain <- d[d$isTrain, , drop=FALSE]
# good inference on train
sigr::wrapChiSqTest(dTrain, 'soln', 'yIdeal')
## [1] "Chi-Square Test summary: pseudo-R2=0.688 (X2(1,N=87)=77.12, p<1e-05)."
table(dTrain$prediction, dTrain$yIdeal)
##
## FALSE TRUE
## FALSE 28 0
## TRUE 2 57
sigr::wrapFisherTest(dTrain, 'prediction', 'yIdeal')
## [1] "Fisher's Exact Test for Count Data: (odds.ratio=Inf, p<1e-05)."
table(dTrain$yObserved, dTrain$yIdeal)
##
## FALSE TRUE
## FALSE 25 11
## TRUE 5 46
sigr::wrapFisherTest(dTrain, 'yObserved', 'yIdeal')
## [1] "Fisher's Exact Test for Count Data: (odds.ratio=19.91, p<1e-05)."
dTest <- d[!d$isTrain, , drop=FALSE]
# good performance on test
sigr::wrapChiSqTest(dTest, 'soln', 'yIdeal')
## [1] "Chi-Square Test summary: pseudo-R2=0.7233 (X2(1,N=113)=105.3, p<1e-05)."
table(dTest$prediction, dTest$yIdeal)
##
## FALSE TRUE
## FALSE 38 0
## TRUE 1 74
sigr::wrapFisherTest(dTest, 'prediction', 'yIdeal')
## [1] "Fisher's Exact Test for Count Data: (odds.ratio=Inf, p<1e-05)."
table(dTest$yObserved, dTest$yIdeal)
##
## FALSE TRUE
## FALSE 24 11
## TRUE 15 63
sigr::wrapFisherTest(dTest, 'yObserved', 'yIdeal')
## [1] "Fisher's Exact Test for Count Data: (odds.ratio=8.936, p<1e-05)."
One application we have used the monotone methodology with good success is: calibrating regressions and classifiers.
This is an idea that came up in discussion with Jeremy Howard. Jeremy, while President and Chief Scientist of Kaggle, decided AUC was a good "early to see if you have something" metric for running contests. That leaves the question: what is a principled mechanical way to convert a score with a good AUC to a score with correct probabilities (a good deviance). Kaggle participants are famous for good solutions. But here is our solution: a arbitrary isotone transform.
That is: we take a model that does well on the AUC
measure (meaning it is good at ranking or reproducing order relations) and build the best model with the same order structure with respect to a more stringent measure (such as sum of squared errors, or deviance). Often this step is ignored or done by binning or some other method- but for systems that are not natively in probability units (such as margin based systems such as support vector machines) this isotone calibration or polish step can be an improvement (assuming one is careful about nested model bias issues).
We can try- that. Suppose we forgot to set type="response"
on a logistic regression and we didn't know the link function is the sigmoid (so we can't directly apply the correction).
source("isotone.R")
# set up example data
set.seed(23525)
d <- data.frame(x = 10*runif(200))
d$yIdeal <- -d$x^2
d$yObserved <- d$yIdeal + 25*rnorm(nrow(d))
d$isTrain <- runif(nrow(d))<=0.5
threshold <- -50
d$yIdeal <- d$yIdeal >= threshold
d$yObserved <- d$yObserved >= threshold
model <- glm(yObserved ~ x, data = d[d$isTrain, , drop=FALSE], family = binomial)
sigr::wrapChiSqTest(model)
## [1] "Chi-Square Test summary: pseudo-R2=0.3522 (X2(1,N=87)=41.57, p<1e-05)."
d$rawScore <- predict(model, newdata = d) # oops forgot type='response'
head(d)
## x yIdeal yObserved isTrain rawScore
## 1 3.615347 TRUE TRUE FALSE 2.0180394
## 2 6.811326 TRUE TRUE TRUE -0.1834170
## 3 7.173432 FALSE TRUE FALSE -0.4328434
## 4 9.732597 FALSE FALSE TRUE -2.1956494
## 5 3.726201 TRUE TRUE FALSE 1.9416808
## 6 2.197222 TRUE TRUE FALSE 2.9948735
# fix it with isotone regression
customCoders = list('c.NonDecreasingV.num' = solveNonDecreasing)
treatments <- vtreat::designTreatmentsC(d[d$isTrain, , drop=FALSE],
'rawScore', 'yObserved', TRUE,
customCoders = customCoders,
verbose = FALSE)
print(treatments$scoreFrame[, c('varName', 'rsq', 'sig', 'needsSplit'), drop=FALSE])
## varName rsq sig needsSplit
## 1 rawScore_NonDecreasingV 0.3370146 2.856172e-10 TRUE
## 2 rawScore_clean 0.3522417 1.138713e-10 FALSE
predictionCollar <- 2/(sum(d$isTrain)-1)
Notice in the score frame vtreat
's cross validation based scoring correctly indentifies that the isotone encoding is over-fitting and not in fact better than the sigmoid link function.
# copy fit over to original data frame
dTreated <- vtreat::prepare(treatments, d)
d$adjScore <- dTreated$rawScore_NonDecreasingV
d$adjScore <- pmax(predictionCollar, pmin(1-predictionCollar, d$adjScore))
d$adjPred <- d$adjScore>=0.5
# and the correct link
d$linkScore <- predict(model, newdata = d, type = 'response')
d$linkPred <- d$linkScore>=0.5
head(d)
## x yIdeal yObserved isTrain rawScore adjScore adjPred
## 1 3.615347 TRUE TRUE FALSE 2.0180394 0.91666667 TRUE
## 2 6.811326 TRUE TRUE TRUE -0.1834170 0.61111111 TRUE
## 3 7.173432 FALSE TRUE FALSE -0.4328434 0.28571429 FALSE
## 4 9.732597 FALSE FALSE TRUE -2.1956494 0.02325581 FALSE
## 5 3.726201 TRUE TRUE FALSE 1.9416808 0.91666667 TRUE
## 6 2.197222 TRUE TRUE FALSE 2.9948735 0.91666667 TRUE
## linkScore linkPred
## 1 0.8826781 TRUE
## 2 0.4542739 FALSE
## 3 0.3934475 FALSE
## 4 0.1001419 FALSE
## 5 0.8745367 TRUE
## 6 0.9523420 TRUE
dTest <- d[!d$isTrain, , drop=FALSE]
sigr::wrapChiSqTest(dTest, 'adjScore', 'yObserved')
## [1] "Chi-Square Test summary: pseudo-R2=0.1356 (X2(1,N=113)=18.96, p=1.335e-05)."
dTest %.>%
group_by_se(., "yObserved") %.>%
summarize_nse(., minAdj := min(adjScore), maxAdj := max(adjScore))
## # A tibble: 2 x 3
## yObserved minAdj maxAdj
## <lgl> <dbl> <dbl>
## 1 FALSE 0.0233 0.977
## 2 TRUE 0.0233 0.977
WVPlots::DoubleDensityPlot(dTest, 'adjScore', 'yObserved',
"adjusted prediction against observations")
sigr::wrapChiSqTest(dTest, 'linkScore', 'yObserved')
## [1] "Chi-Square Test summary: pseudo-R2=0.2027 (X2(1,N=113)=28.36, p<1e-05)."
dTest %.>%
group_by_se(., "yObserved") %.>%
summarize_nse(., minAdj := min(linkScore), maxAdj := max(linkScore))
## # A tibble: 2 x 3
## yObserved minAdj maxAdj
## <lgl> <dbl> <dbl>
## 1 FALSE 0.0860 0.989
## 2 TRUE 0.132 0.989
WVPlots::DoubleDensityPlot(dTest, 'linkScore', 'yObserved',
"link prediction against observations")
sigr::wrapChiSqTest(dTest, 'adjScore', 'yIdeal')
## [1] "Chi-Square Test summary: pseudo-R2=0.7135 (X2(1,N=113)=103.9, p<1e-05)."
dTest %.>%
group_by_se(., "yIdeal") %.>%
summarize_nse(., minAdj := min(adjScore), maxAdj := max(adjScore))
## # A tibble: 2 x 3
## yIdeal minAdj maxAdj
## <lgl> <dbl> <dbl>
## 1 FALSE 0.0233 0.521
## 2 TRUE 0.611 0.977
WVPlots::DoubleDensityPlot(dTest, 'adjScore', 'yIdeal',
"adjusted prediction against ideal (unobserved) concept")
sigr::wrapChiSqTest(dTest, 'linkScore', 'yIdeal')
## [1] "Chi-Square Test summary: pseudo-R2=0.6388 (X2(1,N=113)=93.02, p<1e-05)."
dTest %.>%
group_by_se(., "yIdeal") %.>%
summarize_nse(., minAdj := min(linkScore), maxAdj := max(linkScore))
## # A tibble: 2 x 3
## yIdeal minAdj maxAdj
## <lgl> <dbl> <dbl>
## 1 FALSE 0.0860 0.401
## 2 TRUE 0.418 0.989
WVPlots::DoubleDensityPlot(dTest, 'linkScore', 'yIdeal',
"link prediction against ideal (unobserved) concept")
And we see the adjusted prediction is pretty good, even with the nested model bias issue. In fact even though it shows sings of over-fit on the testing set, it outperforms the original links score in recovering the (unobserved) original concept. This is because the inductive bias we introduced (monotone solution) was something true for the concept (the mapping of link scores to probabilities) but not a property of the noise model; so the transform prefers signal.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.