This package offers methods to perform asymptotically bias-corrected regularized linear discriminant analysis (ABC_RLDA) for cost-sensitive binary classification. The bias-correction is an estimate of the bias term added to regularized discriminant analysis (RLDA) that minimizes the overall risk. The default magnitude of misclassification costs are equal and set to 0.5; however, the package also offers the options to set them to some predetermined values or, alternatively, take them as hyperparameters to tune.
For some application cost of making mistake
For example, in case of some diseases it is more important to
decrease number of false negatives (illness is not detected) even at expence
The purpose of this section is to give examples of basic usage in order to give users a general sense of the package. For illustration purposes iris dataset will be used. First, we load the abcrlda package and iris dataset.
library(abcrlda) data(iris)
Abcrlda is designed for binary classification therefore we will limit training data to two species (virginica and versicolor), 50 observation for each.
train_data <- iris[which(iris[, ncol(iris)] == "versicolor" | iris[, ncol(iris)] == "virginica"), 1:4] train_label <- factor(iris[which(iris[, ncol(iris)] == "versicolor" | iris[, ncol(iris)] == "virginica"), 5])
After that we are ready to fit model and make predictions using resubstitution.
model <- abcrlda(train_data, train_label) predict(model, train_data)
Accuracy can be evaluated by this function:
risk_calculate(model, train_data, train_label)
Cost is one of the most important hyperparameters for the abcrlda. It controls the overall misclassification risk for classes during model fitting. Risk rate for more important class tends to be smaller.
R = e_0 * C_10 + e_1 * C_01
model <- abcrlda(train_data, train_label, cost = c(0.75, 0.25)) test_versicolor <- iris[which(iris[, ncol(iris)] == "versicolor"), 1:4] test_virginica <- iris[which(iris[, ncol(iris)] == "virginica"), 1:4] r <- predict(model, test_versicolor) sum(r != "versicolor") / nrow(test_virginica) # resubstitution error for first class r <- predict(model, test_virginica) sum(r != "virginica") / nrow(test_virginica) # resubstitution error for second class
The effect of costs switching can be seen on error rates:
model <- abcrlda(train_data, train_label, cost = c(0.25, 0.75)) test_versicolor <- iris[which(iris[, ncol(iris)] == "versicolor"), 1:4] test_virginica <- iris[which(iris[, ncol(iris)] == "virginica"), 1:4] r <- predict(model, test_versicolor) sum(r != "versicolor") / nrow(test_virginica) # resubstitution error for first class r <- predict(model, test_virginica) sum(r != "virginica") / nrow(test_virginica) # resubstitution error for second class
Cost values can be provided in different ways. There is short form for normalized cost values:
model <- abcrlda(train_data, train_label, gamma = 0.5, cost = 0.75) a <- predict(model, train_data)
Which is equivalent to:
# same params but more explicit model <- abcrlda(train_data, train_label, gamma = 0.5, cost = c(0.75, 0.25)) b <- predict(model, train_data)
Scaling of a cost do not change prediction that will be made by model:
# same class costs ratio model <- abcrlda(train_data, train_label, gamma = 0.5, cost = c(3, 1)) c <- predict(model, train_data) # all this model will give the same predictions all(a == b & a == c & b == c)
This package contain function to performs grid search to estimate the optimal hyperparameters (gamma and cost) within specified space based on double asymptotic risk estimation or cross validation. Double asymptotic risk estimation is more efficient to compute because it uses closed form for risk estimation. For further details, refer to the article in the reference section.
cost_range <- seq(0.1, 0.9, by = 0.2) gamma_range <- c(0.1, 1, 10, 100, 1000) gs <- grid_search(train_data, train_label, range_gamma = gamma_range, range_cost = cost_range, method = "estimator") model <- abcrlda(train_data, train_label, gamma = gs$gamma, cost = gs$cost)
cost_range <- matrix(1:10, ncol = 2) gamma_range <- c(0.1, 1, 10, 100, 1000) gs <- grid_search(train_data, train_label, range_gamma = gamma_range, range_cost = cost_range, method = "cross") model <- abcrlda(train_data, train_label, gamma = gs$gamma, cost = gs$cost)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.