SMOTE: SMOTE algorithm for unbalanced classification problems
In DMwR: Functions and data for "Data Mining with R"

Description Usage Arguments Details Value Author(s) References Examples

This function handles unbalanced classification problems using the SMOTE method. Namely, it can generate a new "SMOTEd" data set that addresses the class unbalance problem. Alternatively, it can also run a classification algorithm on this new data set and return the resulting model.

1 2	SMOTE(form, data, perc.over = 200, k = 5, perc.under = 200, learner = NULL, ...)

`form`	A formula describing the prediction problem
`data`	A data frame containing the original (unbalanced) data set
`perc.over`	A number that drives the decision of how many extra cases from the minority class are generated (known as over-sampling).
`k`	A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class.
`perc.under`	A number that drives the decision of how many extra cases from the majority classes are selected for each case generated from the minority class (known as under-sampling)
`learner`	Optionally you may specify a string with the name of a function that implements a classification algorithm that will be applied to the resulting SMOTEd data set (defaults to NULL).
`...`	In case you specify a learner (parameter `learner`) you can indicate further arguments that will be used when calling this learner.

Unbalanced classification problems cause problems to many learning algorithms. These problems are characterized by the uneven proportion of cases that are available for each class of the problem.

SMOTE (Chawla et. al. 2002) is a well-known algorithm to fight this problem. The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases. Furthermore, the majority class examples are also under-sampled, leading to a more balanced dataset.

The parameters perc.over and perc.under control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively. perc.over will tipically be a number above 100. With this type of values, for each case in the orginal data set belonging to the minority class, perc.over/100 new examples of that class will be created. If perc.over is a value below 100 than a single case will be generated for a randomly selected proportion (given by perc.over/100) of the cases belonging to the minority class on the original data set. The parameter perc.under controls the proportion of cases of the majority class that will be randomly selected for the final "balanced" data set. This proportion is calculated with respect to the number of newly generated minority class cases. For instance, if 200 new examples were generated for the minority class, a value of perc.under of 100 will randomly select exactly 200 cases belonging to the majority classes from the original data set to belong to the final data set. Values above 100 will select more examples from the majority classes.

The parameter k controls the way the new examples are created. For each currently existing minority class example X new examples will be created (this is controlled by the parameter perc.over as mentioned above). These examples will be generated by using the information from the k nearest neighbours of each example of the minority class. The parameter k controls how many of these neighbours are used.

The function can also be used to obtain directely the classification model from the resulting balanced data set. This can be done by including the name of the R function that implements the classifier in the parameter learner. You may also include other parameters that will be forward to this learning function. If the learner parameter is not NULL (the default) the returning value of the function will be the learned model and not the balanced data set. The function that learns the model should have as first parameter the formula describing the classification problem and in the second argument the training set.

The function can return two different types of values depending on the value of the parameter learner. If this parameter is NULL (the default), the function will return a data frame with the new data set resulting from the application of the SMOTE algorithm. Otherwise the function will return the classification model obtained by the learner specified in the parameter learner.

Luis Torgo ltorgo@dcc.fc.up.pt

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357.

Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).

http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR

## A small example with a data set created artificially from the IRIS
## data 
data(iris)
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common")) 
## checking the class distribution of this artificial data set
table(data$Species)

## now using SMOTE to create a more "balanced problem"
newData <- SMOTE(Species ~ ., data, perc.over = 600,perc.under=100)
table(newData$Species)

## Checking visually the created data
## Not run: 
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
     main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
     main = "SMOTE'd Data")

## End(Not run)

## Now an example where we obtain a model with the "balanced" data
classTree <- SMOTE(Species ~ ., data, perc.over = 600,perc.under=100,
                   learner='rpartXse',se=0.5)
## check the resulting classification tree
classTree
## The tree with the unbalanced data set would be
rpartXse(Species ~ .,data,se=0.5)

Loading required package: lattice
Loading required package: grid

common   rare 
   100     50 

common   rare 
   300    350 
n= 650 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 650 300 rare (0.461538462 0.538461538)  
   2) Sepal.Length>=5.499355 296  20 common (0.932432432 0.067567568)  
     4) Sepal.Width< 3.801671 280   4 common (0.985714286 0.014285714)  
       8) Sepal.Width< 3.45 270   0 common (1.000000000 0.000000000) *
       9) Sepal.Width>=3.45 10   4 common (0.600000000 0.400000000)  
        18) Sepal.Length>=6.45 6   0 common (1.000000000 0.000000000) *
        19) Sepal.Length< 6.45 4   0 rare (0.000000000 1.000000000) *
     5) Sepal.Width>=3.801671 16   0 rare (0.000000000 1.000000000) *
   3) Sepal.Length< 5.499355 354  24 rare (0.067796610 0.932203390)  
     6) Sepal.Width< 2.710803 25   3 common (0.880000000 0.120000000)  
      12) Sepal.Length>=4.7 22   0 common (1.000000000 0.000000000) *
      13) Sepal.Length< 4.7 3   0 rare (0.000000000 1.000000000) *
     7) Sepal.Width>=2.710803 329   2 rare (0.006079027 0.993920973)  
      14) Sepal.Width< 3.000371 32   2 rare (0.062500000 0.937500000)  
        28) Sepal.Length>=5.2 2   0 common (1.000000000 0.000000000) *
        29) Sepal.Length< 5.2 30   0 rare (0.000000000 1.000000000) *
      15) Sepal.Width>=3.000371 297   0 rare (0.000000000 1.000000000) *
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 150 50 common (0.66666667 0.33333333)  
   2) Sepal.Length>=5.45 98  5 common (0.94897959 0.05102041)  
     4) Sepal.Width< 3.45 90  0 common (1.00000000 0.00000000) *
     5) Sepal.Width>=3.45 8  3 rare (0.37500000 0.62500000)  
      10) Sepal.Length>=6.5 3  0 common (1.00000000 0.00000000) *
      11) Sepal.Length< 6.5 5  0 rare (0.00000000 1.00000000) *
   3) Sepal.Length< 5.45 52  7 rare (0.13461538 0.86538462)  
     6) Sepal.Width< 2.8 7  1 common (0.85714286 0.14285714) *
     7) Sepal.Width>=2.8 45  1 rare (0.02222222 0.97777778) *