Monotone (or isotone) regression via the isotone package (also give scam and gbm var.monotone a look, which should have the advantage of also being low complexity).

We will use the vtreat package custom coder interface, which will supply cross-validated significance calculations and out-of sample interpolation (allowing us to apply the monotone transforms to new data). For a more substantial application of vtreat custom coding please see the partial pooling application.

Regression

suppressPackageStartupMessages(library("ggplot2"))
library("seplyr")
source("isotone.R")

# set up example data
set.seed(23525)
d <- data.frame(x = 10*runif(200))
d$yIdeal <- d$x^2
d$yObserved <- d$yIdeal + 25*rnorm(nrow(d))
d$isTrain <- runif(nrow(d))<=0.5

ggplot(data=d, aes(x=x)) + 
  geom_line(aes(y=yIdeal), color='blue', linetype=2) + 
  geom_point(aes(y=yObserved, color=isTrain, shape=isTrain)) +
  ylab('y') +
  ggtitle("ideal and observed responses as functions of x",
          subtitle = "dashed curve: ideal (pre-noise) values")

customCoders = list('n.NonDecreasingV.num' = solveNonDecreasing,
                    'n.NonIncreasingV.num' = solveNonIncreasing)
treatments <- vtreat::designTreatmentsN(d[d$isTrain, , drop=FALSE], 
                                        'x', 'yObserved', 
                                        customCoders = customCoders,
                                        verbose = FALSE)
print(treatments$scoreFrame[, c('varName', 'rsq', 'sig', 'needsSplit'), drop=FALSE])
dTreated <- vtreat::prepare(treatments, d)
d$soln <- dTreated$x_NonDecreasingV

dTrain <- d[d$isTrain, , drop=FALSE]

# good inference on train
sum((dTrain$yIdeal - dTrain$soln)^2)
sum((dTrain$yIdeal - dTrain$yObserved)^2)

dTest <- d[!d$isTrain, , drop=FALSE]

# good performance on test
sum((dTest$yIdeal - dTest$soln)^2)
sum((dTest$yIdeal - dTest$yObserved)^2)

ggplot(data=d, aes(x=x)) + 
  geom_line(aes(y=yIdeal), color='blue', linetype=2) + 
  geom_point(aes(y=yObserved, color=isTrain, shape=isTrain)) +
  geom_line(aes(x=x, y=soln), color='darkgreen') +
  ylab('y') +
  ggtitle("ideal and observed responses as functions of x",
          subtitle = "solid path: isotone fit")

The above formulation is kind of exciting. You get one degree of freedom per data-row (a very large number), but a simple constraint system (that the produced predictions must follow the x-order constraints) is enough to produce reasonable fits. This reminiscent of the maximum entropy formulation of logistic regression, and is evidence one is working with a sort of dual-formulation of a smaller primal problem.

Some notes on smoother implementations can be found here.

Clasification

We can also easily adapt to classification and to categorical inputs.

source("isotone.R")

# set up example data
set.seed(23525)
d <- data.frame(x = 10*runif(200))
d$yIdeal <- -d$x^2
d$yObserved <- d$yIdeal + 25*rnorm(nrow(d))
d$isTrain <- runif(nrow(d))<=0.5
threshold <- -50
d$yIdeal <- d$yIdeal >= threshold
d$yObserved <- d$yObserved >= threshold


# could also build link-space versions
customCoders = list('c.NonDecreasingV.num' = solveNonDecreasing,
                    'c.NonIncreasingV.num' = solveNonIncreasing)
treatments <- vtreat::designTreatmentsC(d[d$isTrain, , drop=FALSE], 
                                        'x', 'yObserved', TRUE,
                                        customCoders = customCoders,
                                        verbose = FALSE)
# examining variables
print(treatments$scoreFrame[, c('varName', 'rsq', 'sig', 'needsSplit'), drop=FALSE])
# copy fit over to original data frame
dTreated <- vtreat::prepare(treatments, d)
d$soln <- dTreated$x_NonIncreasingV
d$prediction <- d$soln>0.5

dTrain <- d[d$isTrain, , drop=FALSE]

# good inference on train
sigr::wrapChiSqTest(dTrain, 'soln', 'yIdeal')
table(dTrain$prediction, dTrain$yIdeal)
sigr::wrapFisherTest(dTrain, 'prediction', 'yIdeal')
table(dTrain$yObserved, dTrain$yIdeal)
sigr::wrapFisherTest(dTrain, 'yObserved', 'yIdeal')

dTest <- d[!d$isTrain, , drop=FALSE]

# good performance on test
sigr::wrapChiSqTest(dTest, 'soln', 'yIdeal')
table(dTest$prediction, dTest$yIdeal)
sigr::wrapFisherTest(dTest, 'prediction', 'yIdeal')
table(dTest$yObserved, dTest$yIdeal)
sigr::wrapFisherTest(dTest, 'yObserved', 'yIdeal')

Model calibration/polish

One application we have used the monotone methodology with good success is: calibrating regressions and classifiers.

This is an idea that came up in discussion with Jeremy Howard. Jeremy, while President and Chief Scientist of Kaggle, decided AUC was a good "early to see if you have something" metric for running contests. That leaves the question: what is a principled mechanical way to convert a score with a good AUC to a score with correct probabilities (a good deviance). Kaggle participants are famous for good solutions. But here is our solution: a arbitrary isotone transform.

That is: we take a model that does well on the AUC measure (meaning it is good at ranking or reproducing order relations) and build the best model with the same order structure with respect to a more stringent measure (such as sum of squared errors, or deviance). Often this step is ignored or done by binning or some other method- but for systems that are not natively in probability units (such as margin based systems such as support vector machines) this isotone calibration or polish step can be an improvement (assuming one is careful about nested model bias issues).

We can try- that. Suppose we forgot to set type="response" on a logistic regression and we didn't know the link function is the sigmoid (so we can't directly apply the correction).

source("isotone.R")

# set up example data
set.seed(23525)
d <- data.frame(x = 10*runif(200))
d$yIdeal <- -d$x^2
d$yObserved <- d$yIdeal + 25*rnorm(nrow(d))
d$isTrain <- runif(nrow(d))<=0.5
threshold <- -50
d$yIdeal <- d$yIdeal >= threshold
d$yObserved <- d$yObserved >= threshold



model <- glm(yObserved ~ x, data = d[d$isTrain, , drop=FALSE], family = binomial)
sigr::wrapChiSqTest(model)

d$rawScore <- predict(model, newdata = d)  # oops forgot type='response'
head(d)

# fix it with isotone regression
customCoders = list('c.NonDecreasingV.num' = solveNonDecreasing)
treatments <- vtreat::designTreatmentsC(d[d$isTrain, , drop=FALSE], 
                                        'rawScore', 'yObserved', TRUE,
                                        customCoders = customCoders,
                                        verbose = FALSE)
print(treatments$scoreFrame[, c('varName', 'rsq', 'sig', 'needsSplit'), drop=FALSE])
predictionCollar <- 2/(sum(d$isTrain)-1)

Notice in the score frame vtreat's cross validation based scoring correctly indentifies that the isotone encoding is over-fitting and not in fact better than the sigmoid link function.

# copy fit over to original data frame
dTreated <- vtreat::prepare(treatments, d)
d$adjScore <- dTreated$rawScore_NonDecreasingV
d$adjScore <- pmax(predictionCollar, pmin(1-predictionCollar, d$adjScore))
d$adjPred <- d$adjScore>=0.5

# and the correct link
d$linkScore <-  predict(model, newdata = d, type = 'response')
d$linkPred <- d$linkScore>=0.5
head(d)

dTest <- d[!d$isTrain, , drop=FALSE]

Adjusted Score versus observed outcomes

sigr::wrapChiSqTest(dTest, 'adjScore', 'yObserved')
dTest %.>% 
  group_by_se(., "yObserved") %.>%
  summarize_nse(., minAdj := min(adjScore), maxAdj := max(adjScore))
WVPlots::DoubleDensityPlot(dTest, 'adjScore', 'yObserved',
                           "adjusted prediction against observations")

Link Score versus observed outcomes

sigr::wrapChiSqTest(dTest, 'linkScore', 'yObserved')
dTest %.>% 
  group_by_se(., "yObserved") %.>%
  summarize_nse(., minAdj := min(linkScore), maxAdj := max(linkScore))
WVPlots::DoubleDensityPlot(dTest, 'linkScore', 'yObserved',
                           "link prediction against observations")

Adjusted Score versus Ideal (Unobserved) Concept

sigr::wrapChiSqTest(dTest, 'adjScore', 'yIdeal')
dTest %.>% 
  group_by_se(., "yIdeal") %.>%
  summarize_nse(., minAdj := min(adjScore), maxAdj := max(adjScore))
WVPlots::DoubleDensityPlot(dTest, 'adjScore', 'yIdeal',
                           "adjusted prediction against ideal (unobserved) concept")

Link Score versus Ideal (Unobserved) Concept

sigr::wrapChiSqTest(dTest, 'linkScore', 'yIdeal')
dTest %.>% 
  group_by_se(., "yIdeal") %.>%
  summarize_nse(., minAdj := min(linkScore), maxAdj := max(linkScore))
WVPlots::DoubleDensityPlot(dTest, 'linkScore', 'yIdeal',
                           "link prediction against ideal (unobserved) concept")

Conclusion

And we see the adjusted prediction is pretty good, even with the nested model bias issue. In fact even though it shows sings of over-fit on the testing set, it outperforms the original links score in recovering the (unobserved) original concept. This is because the inductive bias we introduced (monotone solution) was something true for the concept (the mapping of link scores to probabilities) but not a property of the noise model; so the transform prefers signal.



WinVector/vtreat documentation built on Aug. 29, 2023, 4:49 a.m.