In dongyuanwu/RSBID: Resampling Strategies for Binary Imbalanced Datasets

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  fig.align="center"
)

The RSBID (short for Resampling Strategies for Binary Imbalanced Dadasets) package contains functions of resampling strategies to make the binary imbalanced datasets be more balanced. It is important for an imbalanced dataset before applying a classification algorithm, for the reason that class imbalance will lead to a bad performance of classifiers.

library(RSBID)
# prepare for the display of classification performance 
display <- function(prediction, reference, positiveclass) {
    cm <- caret::confusionMatrix(data=prediction, reference=reference,
                                 mode = "sens_spec", positive=positiveclass)
    print(cm$table)
    data.frame(Accuracy=cm$overall[1], Sensitivity=cm$byClass[1], Specificity=cm$byClass[2],
               row.names=NULL)
}

1 Data Import

data(abalone)
abalone$Class <- as.factor(gsub(" ", "", abalone$Class))
str(abalone)

The abalone dataset is available in the RSBID package. It has r nrow(abalone) observations and r ncol(abalone) variables. In the outcome Class, the possitive examples belong to class 18 and the negative examples belong to class 9. We would like to find the difference between these two kinds of abalone. Therefore, a good performance of a classifier to classify them correctly is expected.

table(abalone$Class)

As we can see, there are r table(abalone$Class)[1] negative examples and r table(abalone$Class)[2] positive examples. The imbalanced ratio (IR) is about r round(table(abalone$Class)[1]/table(abalone$Class)[2], 2).

2 Model Building for a Imbalanced Dataset

First of all, we need to separate this dataset into a training set and a testing set. 0.7 would be used as the cutoff.

set.seed(2019)
inTrain <- caret::createDataPartition(abalone$Class, p = 0.7, list = FALSE)
abalone_train <- abalone[inTrain, ]
abalone_test <- abalone[-inTrain, ]
table(abalone_train$Class)
table(abalone_test$Class)

The training set and the testing set are both imbalanced.

Now we apply Random Forests on this imbalanced dataset to check the classification performance.

set.seed(2019)
fit_orig <- randomForest::randomForest(Class ~ ., data = abalone_train)
pred_class <- predict(fit_orig, abalone_test, type="response")
(perf_orig <- display(pred_class, abalone_test$Class, "positive"))
knitr::kable(perf_orig, row.names=FALSE)

According to the result, the accuracy of this model is actually very high. But all positive examples (i.e., the minority samples) were classified incorrectly so that sensitivity is 0. This performance is very bad actually.

3 Model Building for a Balanced Dataset

Now we used the Under-Sampling Based on Clustering (SBC) algorithm as an example to balance the dataset.

set.seed(2019)
train_sbc <- SBC(abalone_train, "Class")
table(train_sbc$Class)

The dataset has been balanced now. Then build a new Random Forest model for this balanced dataset.

set.seed(2019)
fit_orig <- randomForest::randomForest(Class ~ ., data = train_sbc)
pred_class <- predict(fit_orig, abalone_test, type="response")
perf_sbc <- display(pred_class, abalone_test$Class, "positive")
knitr::kable(perf_sbc, row.names=FALSE)

This time the classifier greatly predict most of the minority samples in testing set, so the sensitivity is r round(perf_sbc[1, 2], 4). In the mean time, the specificity is r round(perf_sbc[1, 3], 4), which means this classifier can also correctly classify most of the majority samples. The improvement of performance is significant!

4 Comparison among Four Resampling Strategies

The RSBID package provides several resampling strategies: random over-sampling algorithm (ROS), random under-sampling algorithm (RUS), under-sampling based on clustering algorithm (SBC), and synthetic minority over-sampling technique (SMOTE) and synthetic minority over-sampling technique-Nominal Continuous (SMOTE-NC). For the reason that abalone is a dataset that its all predictors are continuous, ROS, RUS, SBC and SMOTE could be used to balance it.

The classification performance of Random Forests on four balanced datasets by respectively using these four strategies would be shown below.

set.seed(2019)
train_ros <- ROS(abalone_train, "Class")
train_rus <- RUS(abalone_train, "Class")
train_smote <- SMOTE(abalone_train, "Class")
fit_ros <- randomForest::randomForest(Class ~ ., data=train_ros)
fit_rus <- randomForest::randomForest(Class ~ ., data=train_rus)
fit_smote <- randomForest::randomForest(Class ~ ., data=train_smote)
pred_class_ros <- predict(fit_ros, abalone_test, type="response")
perf_ros <- display(pred_class_ros, abalone_test$Class, "positive")  
pred_class_rus <- predict(fit_rus, abalone_test, type="response")
perf_rus <- display(pred_class_rus, abalone_test$Class, "positive")
pred_class_smote <- predict(fit_smote, abalone_test, type="response")
perf_smote <- display(pred_class_smote, abalone_test$Class, "positive")
output <- rbind(perf_orig, perf_ros, perf_rus, perf_sbc, perf_smote)
rownames(output) <- c("Imbalanced", "Balanced by ROS", "Balanced by RUS",
                      "Balanced by SBC", "Balanced by SMOTE")
knitr::kable(output)

As we can see, all of these four algorithms helped to improve the performance of random forest more or less. On the premise of ensuring a good accuracy of a classifier, the sensitivity, which is highly related to the minority class that we are more concerned, is improved by our resampling strategies.

5 Another Quick Example

RSBID provides another dataset named bank with continuous and categorical predictors.

data(bank)
str(bank)

The bank dataset has r nrow(bank) observations and r ncol(bank) variables. The data is related with direct marketing campaigns of a Portuguese banking institution. Except the outcome deposit, there are 3 categorical predictors and 7 continuous predictors.

table(bank$deposit)

In the outcome deposit, "yes" means that the clients have subscribed a term deposit while "no" means that the clients have not subscribed a term deposit. There are r table(bank$deposit)[1] "no" examples and r table(bank$deposit)[2] "yes" examples. The imbalanced ratio (IR) is about r round(table(bank$deposit)[1]/table(bank$deposit)[2], 2).

We would like to find the difference between these two kinds of desired targets. Therefore, a good performance of a classifier to classify them correctly is expected.

First of all, we need to separate this dataset into a training set and a testing set. 0.7 would be used as the cutoff.

set.seed(2019)
inTrain <- caret::createDataPartition(bank$deposit, p = 0.7, list = FALSE)
bank_train <- bank[inTrain, ]
bank_test <- bank[-inTrain, ]
table(bank_train$deposit)
table(bank_test$deposit)

The training set and the testing set are both imbalanced and their IRs are roughly the same as the whole dataset.

Now we apply Support Vector Machine (SVM) with Gaussian kernel on the imbalanced training set and then predict the deposit status in the testing set to check the classification performance.

set.seed(2019)
fit_orig <- e1071::svm(deposit ~ ., data = bank_train, kernel = "radial",
                       probability=TRUE)
pred_class <- predict(fit_orig, bank_test)
perf_orig <- display(pred_class, bank_test$deposit, positive="yes")
knitr::kable(perf_orig, row.names=FALSE)

According to the result, the accuracy of this model is realy high. But all examples with "yes" (i.e., the minority samples) were classified incorrectly so that sensitivity is 0. This performance is very bad. There is a large space to improve it.

Then we use ROS, RUS, SBC and SMOTE-NC to balance it.

set.seed(2019)
train_ros <- ROS(bank_train, "deposit")
train_rus <- RUS(bank_train, "deposit")
train_sbc <- SBC(bank_train, "deposit")
train_smotenc <- SMOTE_NC(bank_train, "deposit")

fit_ros <- e1071::svm(deposit ~ ., data=train_ros, kernel="radial", probability=TRUE)
fit_rus <- e1071::svm(deposit ~ ., data=train_rus, kernel="radial", probability=TRUE)
fit_sbc <- e1071::svm(deposit ~ ., data=train_sbc, kernel="radial", probability=TRUE)
fit_smotenc <- e1071::svm(deposit ~ ., data=train_smotenc, kernel="radial", probability=TRUE)
pred_class_ros <- predict(fit_ros, bank_test)
perf_ros <- display(pred_class_ros, bank_test$deposit, "yes")  
pred_class_rus <- predict(fit_rus, bank_test)
perf_rus <- display(pred_class_rus, bank_test$deposit, "yes")
pred_class_sbc <- predict(fit_sbc, bank_test)
perf_sbc <- display(pred_class_sbc, bank_test$deposit, "yes")
pred_class_smotenc <- predict(fit_smotenc, bank_test)
perf_smotenc <- display(pred_class_smotenc, bank_test$deposit, "yes")
output <- rbind(perf_orig, perf_ros, perf_rus, perf_sbc, perf_smotenc)
rownames(output) <- c("Imbalanced", "Balanced by ROS", "Balanced by RUS",
                      "Balanced by SBC", "Balanced by SMOTE-NC")
knitr::kable(output)