Home

/

GitHub

/

evThreshold: Classification Probability Threshold Optimization Based on...

evThreshold: Classification Probability Threshold Optimization Based on...
In etlundquist/eRic: Eric's R functions developed while a summer analytics intern at Enova

Description Usage Arguments Details Value Examples

Calculate an optimal probability threshold for a classification problem based on the actual class distribution, the model predicted probabilities, and the costs/benefits associated with each cell in the confusion matrix.

1	evThreshold(response, pprob, crMatrix, plot.points = 100)

`response`	(numeric) vector of actual class values (as a binary numeric variable)
`pprob`	(numeric) vector of model predicted probabilities
`crMatrix`	(numeric) matrix containing the net revenue associated with each cell in the confusion matrix (see details)
`plot.points`	(integer) number of threshold probabilities to evaluate when making the performance metric plot (see details)

Given actual class values and model predicted probabilities one can optimize the classification threshold with respect to the costs and benefits associated with correctly identifying positive and negative cases, and making Type I and Type II errors. This is useful when overall classification accuracy doesn't align with the goals of the modeling process and you can reasonably estimate the costs and benefits associated with each cell in the confusion matrix. Given this information the optimization problem then becomes one of maximizing the expected value of a new case based on: a. the predictive power of the model; b. the actual class distribution (i.e. proportion of positive cases); c. the benefits and costs associated with the new case ending up in each cell of the confusion matrix. The expected value of a new case can be expressed as:

EV = pr(P) * [TPR*R(TP) - FNR*C(FN)] + pr(N) * [TNR*R(TN) - FPR*C(FP)]

Where:

pr(P) - proportion of positive cases
pr(N) - proportion of negative cases
TPR - True Positive Rate
R(TP) - Revenue/Utility associated with a True Positive
FNR - False Negative Rate
C(FN) - Cost/Utility associated with a False Negative
TNR - True Negative Rate
R(TN) - Revenue/Utility associated with a True Negative
FPR - False Positive Rate
C(FP) - Cost/Utility associated with a False Positive

You need to specify the benefits/costs associated with each confusion matrix cell in crMatrix where the rows correspond to actual class values and the columns correspond to predicted class values. This implies crMatrix[2,2] is the benefit associated with correctly identifying positive cases, and crMatrix[1,2] is the cost associated with mistakenly classifying a negative case as a positive one. Diagonal entries (benefits) will typically be greater than or equal to zero, and off-diagonal entries (costs) will typically be less than or equal to zero (costs expressed as negative numbers).

The function will return the optimal classification threshold, the unit expected value given that threshold, and a ggplot2 object containing series for Sensitivity, Specificity, Accuracy, and Normalized Expected Value with respect to different probability thresholds. The expected value series is normalized to [0,1] so that it can be displayed on the same plot as the other metrics.

a list containing the following elements:

best.threshold - optimal probability threshold
best.ev - unit expected value given the optimal probability threshold
plot.metrics - a plot showing various performance metrics with respect to cutoff threshold

library(gbm)
library(caret)
data(GermanCredit, package = 'caret')

credit <- GermanCredit
credit$Class <- as.numeric(credit$Class == 'Good')
credit  <- credit[,-nearZeroVar(credit)]
gbm.fit <- gbm(Class ~ ., data = credit, n.trees = 100, shrinkage = 0.1, cv.folds = 5, distribution = 'bernoulli')
pprob   <- predict(gbm.fit, n.trees = gbm.perf(gbm.fit), type = 'response')

crMatrix <- matrix(c(0, -2, -4, 2), nrow = 2)
# matrix cells: [r(TN), c(FN), c(FP), r(TP)]
# true negatives yield no benefit, false negatives are lost potential good customers,
# false positives are approved bad customers, and true positives are approved good customers

res <- evThreshold(credit$Class, pprob, crMatrix)
res$plot.metrics
res$best.threshold
res$best.ev

etlundquist/eRic documentation built on May 16, 2019, 9:07 a.m.

etlundquist/eRic index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

etlundquist/eRic
Eric's R functions developed while a summer analytics intern at Enova

evThreshold: Classification Probability Threshold Optimization Based on...
In etlundquist/eRic: Eric's R functions developed while a summer analytics intern at Enova

Description

Usage

Arguments

Details

Value

Examples

Related to evThreshold in etlundquist/eRic...

R Package Documentation

Browse R Packages

We want your feedback!

etlundquist/eRic Eric's R functions developed while a summer analytics intern at Enova

evThreshold: Classification Probability Threshold Optimization Based on... In etlundquist/eRic: Eric's R functions developed while a summer analytics intern at Enova

Description

Usage

Arguments

Details

Value

Examples

Related to evThreshold in etlundquist/eRic...

R Package Documentation

Browse R Packages

We want your feedback!

etlundquist/eRic
Eric's R functions developed while a summer analytics intern at Enova

evThreshold: Classification Probability Threshold Optimization Based on...
In etlundquist/eRic: Eric's R functions developed while a summer analytics intern at Enova