evThreshold: Classification Probability Threshold Optimization Based on...

Description Usage Arguments Details Value Examples

Description

Calculate an optimal probability threshold for a classification problem based on the actual class distribution, the model predicted probabilities, and the costs/benefits associated with each cell in the confusion matrix.

Usage

1
evThreshold(response, pprob, crMatrix, plot.points = 100)

Arguments

response

(numeric) vector of actual class values (as a binary numeric variable)

pprob

(numeric) vector of model predicted probabilities

crMatrix

(numeric) matrix containing the net revenue associated with each cell in the confusion matrix (see details)

plot.points

(integer) number of threshold probabilities to evaluate when making the performance metric plot (see details)

Details

Given actual class values and model predicted probabilities one can optimize the classification threshold with respect to the costs and benefits associated with correctly identifying positive and negative cases, and making Type I and Type II errors. This is useful when overall classification accuracy doesn't align with the goals of the modeling process and you can reasonably estimate the costs and benefits associated with each cell in the confusion matrix. Given this information the optimization problem then becomes one of maximizing the expected value of a new case based on: a. the predictive power of the model; b. the actual class distribution (i.e. proportion of positive cases); c. the benefits and costs associated with the new case ending up in each cell of the confusion matrix. The expected value of a new case can be expressed as:

EV = pr(P) * [TPR*R(TP) - FNR*C(FN)] + pr(N) * [TNR*R(TN) - FPR*C(FP)]

Where:

You need to specify the benefits/costs associated with each confusion matrix cell in crMatrix where the rows correspond to actual class values and the columns correspond to predicted class values. This implies crMatrix[2,2] is the benefit associated with correctly identifying positive cases, and crMatrix[1,2] is the cost associated with mistakenly classifying a negative case as a positive one. Diagonal entries (benefits) will typically be greater than or equal to zero, and off-diagonal entries (costs) will typically be less than or equal to zero (costs expressed as negative numbers).

The function will return the optimal classification threshold, the unit expected value given that threshold, and a ggplot2 object containing series for Sensitivity, Specificity, Accuracy, and Normalized Expected Value with respect to different probability thresholds. The expected value series is normalized to [0,1] so that it can be displayed on the same plot as the other metrics.

Value

a list containing the following elements:

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
library(gbm)
library(caret)
data(GermanCredit, package = 'caret')

credit <- GermanCredit
credit$Class <- as.numeric(credit$Class == 'Good')
credit  <- credit[,-nearZeroVar(credit)]
gbm.fit <- gbm(Class ~ ., data = credit, n.trees = 100, shrinkage = 0.1, cv.folds = 5, distribution = 'bernoulli')
pprob   <- predict(gbm.fit, n.trees = gbm.perf(gbm.fit), type = 'response')

crMatrix <- matrix(c(0, -2, -4, 2), nrow = 2)
# matrix cells: [r(TN), c(FN), c(FP), r(TP)]
# true negatives yield no benefit, false negatives are lost potential good customers,
# false positives are approved bad customers, and true positives are approved good customers

res <- evThreshold(credit$Class, pprob, crMatrix)
res$plot.metrics
res$best.threshold
res$best.ev

etlundquist/eRic documentation built on May 16, 2019, 9:07 a.m.