smote: SMOTE algorithm for unbalanced classification problems

Description Usage Arguments Details Value Author(s) References Examples

Description

This function handles unbalanced classification problems using the SMOTE method. Namely, it can generate a new "SMOTEd" data set that addresses the class unbalance problem.

Usage

1
smote(form, data, perc.over = 2, k = 5, perc.under = 2)

Arguments

form

A formula describing the prediction problem

data

A data frame containing the original (unbalanced) data set

perc.over

A number that drives the decision of how many extra cases from the minority class are generated (known as over-sampling).

k

A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class.

perc.under

A number that drives the decision of how many extra cases from the majority classes are selected for each case generated from the minority class (known as under-sampling)

Details

Unbalanced classification problems cause problems to many learning algorithms. These problems are characterized by the uneven proportion of cases that are available for each class of the problem.

SMOTE (Chawla et. al. 2002) is a well-known algorithm to fight this problem. The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases. Furthermore, the majority class examples are also under-sampled, leading to a more balanced dataset.

The parameters perc.over and perc.under control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively. perc.over will tipically be a number above 1. With this type of values, for each case in the orginal data set belonging to the minority class, perc.over new examples of that class will be created. If perc.over is a value below 1 than a single case will be generated for a randomly selected proportion (given by perc.over) of the cases belonging to the minority class on the original data set. The parameter perc.under controls the proportion of cases of the majority class that will be randomly selected for the final "balanced" data set. This proportion is calculated with respect to the number of newly generated minority class cases. For instance, if 200 new examples were generated for the minority class, a value of perc.under of 1 will randomly select exactly 200 cases belonging to the majority classes from the original data set to belong to the final data set. Values above 1 will select more examples from the majority classes.

The parameter k controls the way the new examples are created. For each currently existing minority class example X new examples will be created (this is controlled by the parameter perc.over as mentioned above). These examples will be generated by using the information from the k nearest neighbours of each example of the minority class. The parameter k controls how many of these neighbours are used.

Value

The function returns a data frame with the new data set resulting from the application of the SMOTE algorithm.

Author(s)

Luis Torgo ltorgo@dcc.fc.up.pt

References

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357.

Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187). http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR

Torgo, L. (2014) An Infra-Structure for Performance Estimation and Experimental Comparison of Predictive Models in R. arXiv:1412.0436 [cs.MS] http://arxiv.org/abs/1412.0436

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## A small example with a data set created artificially from the IRIS
## data 
data(iris)
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common")) 
## checking the class distribution of this artificial data set
table(data$Species)

## now using SMOTE to create a more "balanced problem"
newData <- smote(Species ~ ., data, perc.over = 6,perc.under=1)
table(newData$Species)

## Checking visually the created data
## Not run: 
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
     main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
     main = "SMOTE'd Data")

## End(Not run)

performanceEstimation documentation built on May 2, 2019, 6:01 a.m.