This function applies the Condensed Nearest Neighbors (CNN) strategy for imbalanced multiclass problems. It constructs a subset of examples which are able to correctly classify the original data set using a one nearest neighbor rule.
A formula describing the prediction problem.
A data frame containing the original imbalanced data set.
A character string indicating which distance metric to use when determining the k nearest neighbors. See the details. Defaults to "Euclidean".
A number indicating the value of p if the "p-norm" distance is chosen. Only necessary to define if a "p-norm" is chosen in the
A character vector indicating which are the most important classes. Defaults to "smaller" which means that the smaller classes are automatically determined. In this case, all the smaller classes are those with a frequency below #examples/#classes. With the selection of option "smaller" those classes are the ones considered important for the user.
dist allows the user to define the distance metric to be used in the neighbors computation. Although the default is the Euclidean distance, other metrics are available. This allows the computation of distances in data sets with, for instance, both nominal and numeric features. The options available for the distance functions are as follows:
- for data with only numeric features: "Manhattan", "Euclidean", "Canberra", "Chebyshev", "p-norm";
- for data with only nominal features: "Overlap";
- for dealing with both nominal and numeric features: "HEOM", "HVDM".
When the "p-norm" is selected for the
dist parameter, it is also necessary to define the value of parameter
p. The value of parameter
p sets which "p-norm" will be used. For instance, if
p is set to 1, the "1-norm" (or Manhattan distance) is used, and if
p is set to 2, the "2-norm" (or Euclidean distance) is applied.
For more details regarding the distance functions implemented in UBL package please see the package vignettes.
This function applies the Condensed Nearest Neighbors (CNN) strategy for dealing with imbalanced multiclass problems. The classes selected in
Cl are considered the most important ones and all the others are under-sampled. The CNN under-sampling strategy starts with a set composed by all the examples from the important classes and one randomly selected example from the other classes. Then, examples from the other classes are added to the set forming a subset of examples which correctly classifies the original data set using a one nearest neighbor rule.
The function returns a list with a data frame with the new data set resulting from the application of the CNN strategy, a character vector with the important classes, and another character vector with the unimportant classes.
Hart, P. E. (1968). The condensed nearest neighbor rule IEEE Transactions on Information Theory, 14, 515-516
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
library(DMwR) data(algae) clean.algae <- algae[complete.cases(algae), ] myCNN <- CNNClassif(season~., clean.algae, Cl = c("summer", "spring", "winter"), dist = "HEOM") CNN1 <- CNNClassif(season~., clean.algae, Cl = "smaller", dist = "HEOM") CNN2<- CNNClassif(season~., clean.algae, Cl = "summer",dist = "HVDM") summary(myCNN[]$season) summary(CNN1[]$season) summary(CNN2[]$season) library(MASS) data(cats) CNN.catsF <- CNNClassif(Sex~., cats, Cl = "F") CNN.cats <- CNNClassif(Sex~., cats, Cl = "smaller")
Loading required package: MBA Loading required package: gstat Loading required package: automap Loading required package: sp Loading required package: randomForest randomForest 4.6-12 Type rfNews() to see new features/changes/bug fixes. Loading required package: lattice Loading required package: grid autumn spring summer winter 35 48 43 57 autumn spring summer winter 36 46 43 57 autumn spring summer winter 35 46 43 57
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.