Description Usage Arguments Details Value Author(s) References Examples
This function handles unbalanced classification problems using the SMOTE method. Namely, it can generate a new "SMOTEd" data set that addresses the class unbalance problem. Alternatively, it can also run a classification algorithm on this new data set and return the resulting model.
1 2 |
form |
A formula describing the prediction problem |
data |
A data frame containing the original (unbalanced) data set |
perc.over |
A number that drives the decision of how many extra cases from the minority class are generated (known as over-sampling). |
k |
A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class. |
perc.under |
A number that drives the decision of how many extra cases from the majority classes are selected for each case generated from the minority class (known as under-sampling) |
learner |
Optionally you may specify a string with the name of a function that implements a classification algorithm that will be applied to the resulting SMOTEd data set (defaults to NULL). |
... |
In case you specify a learner (parameter |
Unbalanced classification problems cause problems to many learning algorithms. These problems are characterized by the uneven proportion of cases that are available for each class of the problem.
SMOTE (Chawla et. al. 2002) is a well-known algorithm to fight this problem. The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases. Furthermore, the majority class examples are also under-sampled, leading to a more balanced dataset.
The parameters perc.over
and perc.under
control the amount
of over-sampling of the minority class and under-sampling of the
majority classes, respectively. perc.over
will tipically be a
number above 100. With this type of values, for each case in the orginal
data set belonging to the minority class, perc.over/100
new
examples of that class will be created. If perc.over
is a value
below 100 than a single case will be generated for a randomly selected proportion (given
by perc.over/100
) of the cases belonging to the minority class on
the original data set. The parameter perc.under
controls the
proportion of cases of the majority class that will be randomly selected
for the final "balanced" data set. This proportion is calculated with
respect to the number of newly generated minority class cases. For
instance, if 200 new examples were generated for the minority class, a
value of perc.under
of 100 will randomly select exactly 200 cases
belonging to the majority classes from the original data set to belong
to the final data set. Values above 100 will select more examples from
the majority classes.
The parameter k
controls the way the new examples are
created. For each currently existing minority class example X new
examples will be created (this is controlled by the parameter
perc.over
as mentioned above). These examples will be generated
by using the information from the k
nearest neighbours of each
example of the minority class. The parameter k
controls how many
of these neighbours are used.
The function can also be used to obtain directely the classification
model from the resulting balanced data set. This can be done by
including the name of the R function that implements the classifier in
the parameter learner
. You may also include other parameters that
will be forward to this learning function. If the learner
parameter is not NULL
(the default) the returning value of the
function will be the learned model and not the balanced data set. The
function that learns the model should have as first parameter the
formula describing the classification problem and in the second argument
the training set.
The function can return two different types of values depending on the
value of the parameter learner
. If this parameter is
NULL
(the default), the function will return a data frame with
the new data set resulting from the application of the SMOTE
algorithm. Otherwise the function will return the classification model
obtained by the learner specified in the parameter learner
.
Luis Torgo ltorgo@dcc.fc.up.pt
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357.
Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).
http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | ## A small example with a data set created artificially from the IRIS
## data
data(iris)
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))
## checking the class distribution of this artificial data set
table(data$Species)
## now using SMOTE to create a more "balanced problem"
newData <- SMOTE(Species ~ ., data, perc.over = 600,perc.under=100)
table(newData$Species)
## Checking visually the created data
## Not run:
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
main = "SMOTE'd Data")
## End(Not run)
## Now an example where we obtain a model with the "balanced" data
classTree <- SMOTE(Species ~ ., data, perc.over = 600,perc.under=100,
learner='rpartXse',se=0.5)
## check the resulting classification tree
classTree
## The tree with the unbalanced data set would be
rpartXse(Species ~ .,data,se=0.5)
|
Loading required package: lattice
Loading required package: grid
common rare
100 50
common rare
300 350
n= 650
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 650 300 rare (0.461538462 0.538461538)
2) Sepal.Length>=5.499355 296 20 common (0.932432432 0.067567568)
4) Sepal.Width< 3.801671 280 4 common (0.985714286 0.014285714)
8) Sepal.Width< 3.45 270 0 common (1.000000000 0.000000000) *
9) Sepal.Width>=3.45 10 4 common (0.600000000 0.400000000)
18) Sepal.Length>=6.45 6 0 common (1.000000000 0.000000000) *
19) Sepal.Length< 6.45 4 0 rare (0.000000000 1.000000000) *
5) Sepal.Width>=3.801671 16 0 rare (0.000000000 1.000000000) *
3) Sepal.Length< 5.499355 354 24 rare (0.067796610 0.932203390)
6) Sepal.Width< 2.710803 25 3 common (0.880000000 0.120000000)
12) Sepal.Length>=4.7 22 0 common (1.000000000 0.000000000) *
13) Sepal.Length< 4.7 3 0 rare (0.000000000 1.000000000) *
7) Sepal.Width>=2.710803 329 2 rare (0.006079027 0.993920973)
14) Sepal.Width< 3.000371 32 2 rare (0.062500000 0.937500000)
28) Sepal.Length>=5.2 2 0 common (1.000000000 0.000000000) *
29) Sepal.Length< 5.2 30 0 rare (0.000000000 1.000000000) *
15) Sepal.Width>=3.000371 297 0 rare (0.000000000 1.000000000) *
n= 150
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 150 50 common (0.66666667 0.33333333)
2) Sepal.Length>=5.45 98 5 common (0.94897959 0.05102041)
4) Sepal.Width< 3.45 90 0 common (1.00000000 0.00000000) *
5) Sepal.Width>=3.45 8 3 rare (0.37500000 0.62500000)
10) Sepal.Length>=6.5 3 0 common (1.00000000 0.00000000) *
11) Sepal.Length< 6.5 5 0 rare (0.00000000 1.00000000) *
3) Sepal.Length< 5.45 52 7 rare (0.13461538 0.86538462)
6) Sepal.Width< 2.8 7 1 common (0.85714286 0.14285714) *
7) Sepal.Width>=2.8 45 1 rare (0.02222222 0.97777778) *
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.