README.md
In RomeroBarata/bimba: Sampling Algorithms for Two-Class Imbalanced Data Sets

bimba

The bimba package implements a variety of sampling algorithms to reduce the imbalance present in many real world data sets. Although multi-class imbalanced data sets are common, bimba has been designed to work only with two-class imbalanced data sets.

bimba's main goal is to be flexible as its main use is for research purposes. In addition, as many over-sampling and under-sampling algorithms have a similar structure and several common hyperparameters, a lot of care was taken to ensure consistency between different functions, making bimba intuitive to use.

bimba is under active development and not yet available on CRAN. The development version can be installed as follows:

# install.packages("devtools")
devtools::install_github("RomeroBarata/bimba")

## Some nice setup for the graphics
library(ggplot2)
clean_theme <- theme_minimal() + theme(axis.title.x = element_blank(),
                                       axis.title.y = element_blank())

## bimba in action
library(bimba)
sample_data <- generate_imbalanced_data(num_examples = 200L,
                                        imbalance_ratio = 10,
                                        noise_maj = 0,
                                        noise_min = 0.04,
                                        seed = 42)
ggplot(sample_data, aes(x = V1, y = V2, colour = target)) + 
  geom_point(size = 2) + clean_theme

# Balance the distribution of examples using SMOTE
smoted_data <- SMOTE(sample_data, perc_min = 50, k = 5)
# Sanity check. Did it really balance?
table(smoted_data$target)
ggplot(smoted_data, aes(x = V1, y = V2, colour = target)) + 
  geom_point(size = 2) + clean_theme

# SMOTE is not robust to noisy minority examples. Lets add a cleaning step 
# to the minority class before using SMOTE.
ssed_data <- sampling_sequence(sample_data, algorithms = c("NRAS", "SMOTE"))
ggplot(ssed_data, aes(x = V1, y = V2, colour = target)) + 
  geom_point(size = 2) + clean_theme

# Clean using ENN, double the size of the minority class using SMOTE, and 
# balance the distribution using RUS.
algorithms <- c("ENN", "SMOTE", "RUS")
parameters <- list(
  ENN = list(remove_class = "Minority", k = 3),
  SMOTE = list(perc_over = 100, k = 5),
  RUS = list(perc_maj = 50)
)

ssed2_data <- sampling_sequence(sample_data, algorithms = algorithms, 
                                parameters = parameters)
ggplot(ssed2_data, aes(x = V1, y = V2, colour = target)) + 
  geom_point(size = 2) + clean_theme

Many over-sampling, under-sampling, and hybrid algorithms are available. In addition, the algorithms can be easily chained using the sampling_sequence function. A complete list of the algorithms, broken down by their type, is available below.

ADASYN: Adaptive Synthetic Sampling [7]
BDLSMOTE: borderline-SMOTE1 and borderline-SMOTE2 [6]
MWMOTE: Majority Weighted Minority Over-Sampling TEchnique [10]
ROS: Random Over-Sampling
RWO: Random Walk Over-Sampling [11]
SLSMOTE: Safe-Level-SMOTE [8]
SMOTE: Synthetic Minority Over-Sampling TEchnique [5]

ENN: Edited Nearest Neighbours [1]
KMUS: k-Means Under-Sampling
NCL: Neighbourhood Cleaning Rule [4]
OSS: One-Sided Selection [3]
RUS: Random Under-Sampling
SBC: Under-Sampling Based on Clustering [9]
TL: Tomek Links [2]

NRAS: Noise Reduction A Priori Synthetic Over-Sampling [12]

To make NRAS more general its cleaning step has been decoupled from the over-sampling step.

sampling_sequence: Convenience function to chain sampling algorithms together

Although several other packages implement sampling algorithms they differ to bimba in a few ways. Below is a non-exhaustive list of related packages broken down by languages.

[1] Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408-421.

[2] Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on systems, Man, and Cybernetics, (6), 448-452.

[3] Kubat, M., & Matwin, S. (1997, July). Addressing the curse of imbalanced training sets: one-sided selection. In ICML (Vol. 97, pp. 179-186).

[4] Laurikkala, J. (2001, July). Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe (pp. 63-66). Springer, Berlin, Heidelberg.

[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

[6] Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878-887). Springer Berlin Heidelberg.

[7] He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on (pp. 1322-1328). IEEE.

[8] Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Advances in knowledge discovery and data mining, 475-482.

[9] Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.

[10] Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405-425.

[11] Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 20, 99-116.

[12] Rivera, W. A. (2017). Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets. Information Sciences, 408, 146-161.

RomeroBarata/bimba documentation built on May 17, 2019, 8:03 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

RomeroBarata/bimba
Sampling Algorithms for Two-Class Imbalanced Data Sets

README.md
In RomeroBarata/bimba: Sampling Algorithms for Two-Class Imbalanced Data Sets

bimba

Overview

Installation

Quick Tour

Available Algorithms

Over-Sampling

Under-Sampling

Cleaning

Misc

Related Packages

Python

R

References

R Package Documentation

Browse R Packages

We want your feedback!

RomeroBarata/bimba Sampling Algorithms for Two-Class Imbalanced Data Sets

README.md In RomeroBarata/bimba: Sampling Algorithms for Two-Class Imbalanced Data Sets

bimba

Overview

Installation

Quick Tour

Available Algorithms

Over-Sampling

Under-Sampling

Cleaning

Misc

Related Packages

Python

R

References

R Package Documentation

Browse R Packages

We want your feedback!

RomeroBarata/bimba
Sampling Algorithms for Two-Class Imbalanced Data Sets

README.md
In RomeroBarata/bimba: Sampling Algorithms for Two-Class Imbalanced Data Sets