bimba package implements a variety of sampling algorithms to reduce the imbalance present in many real world data sets. Although multi-class imbalanced data sets are common,
bimba has been designed to work only with two-class imbalanced data sets.
bimba's main goal is to be flexible as its main use is for research purposes. In addition, as many over-sampling and under-sampling algorithms have a similar structure and several common hyperparameters, a lot of care was taken to ensure consistency between different functions, making
bimba intuitive to use.
bimba is under active development and not yet available on CRAN. The
development version can be installed as follows:
# install.packages("devtools") devtools::install_github("RomeroBarata/bimba")
## Some nice setup for the graphics library(ggplot2) clean_theme <- theme_minimal() + theme(axis.title.x = element_blank(), axis.title.y = element_blank()) ## bimba in action library(bimba) sample_data <- generate_imbalanced_data(num_examples = 200L, imbalance_ratio = 10, noise_maj = 0, noise_min = 0.04, seed = 42) ggplot(sample_data, aes(x = V1, y = V2, colour = target)) + geom_point(size = 2) + clean_theme # Balance the distribution of examples using SMOTE smoted_data <- SMOTE(sample_data, perc_min = 50, k = 5) # Sanity check. Did it really balance? table(smoted_data$target) ggplot(smoted_data, aes(x = V1, y = V2, colour = target)) + geom_point(size = 2) + clean_theme # SMOTE is not robust to noisy minority examples. Lets add a cleaning step # to the minority class before using SMOTE. ssed_data <- sampling_sequence(sample_data, algorithms = c("NRAS", "SMOTE")) ggplot(ssed_data, aes(x = V1, y = V2, colour = target)) + geom_point(size = 2) + clean_theme # Clean using ENN, double the size of the minority class using SMOTE, and # balance the distribution using RUS. algorithms <- c("ENN", "SMOTE", "RUS") parameters <- list( ENN = list(remove_class = "Minority", k = 3), SMOTE = list(perc_over = 100, k = 5), RUS = list(perc_maj = 50) ) ssed2_data <- sampling_sequence(sample_data, algorithms = algorithms, parameters = parameters) ggplot(ssed2_data, aes(x = V1, y = V2, colour = target)) + geom_point(size = 2) + clean_theme
Many over-sampling, under-sampling, and hybrid algorithms are available. In addition, the algorithms can be easily chained using the
sampling_sequence function. A complete list of the algorithms, broken down by their type, is available below.
ADASYN: Adaptive Synthetic Sampling 
BDLSMOTE: borderline-SMOTE1 and borderline-SMOTE2 
MWMOTE: Majority Weighted Minority Over-Sampling TEchnique 
ROS: Random Over-Sampling
RWO: Random Walk Over-Sampling 
SLSMOTE: Safe-Level-SMOTE 
SMOTE: Synthetic Minority Over-Sampling TEchnique 
ENN: Edited Nearest Neighbours 
KMUS: k-Means Under-Sampling
NCL: Neighbourhood Cleaning Rule 
OSS: One-Sided Selection 
RUS: Random Under-Sampling
SBC: Under-Sampling Based on Clustering 
TL: Tomek Links 
NRAS: Noise Reduction A Priori Synthetic Over-Sampling 
NRAS more general its cleaning step has been decoupled from the over-sampling step.
sampling_sequence: Convenience function to chain sampling algorithms together
Although several other packages implement sampling algorithms they differ to
bimba in a few ways. Below is a non-exhaustive list of related packages
broken down by languages.
 Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408-421.
 Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on systems, Man, and Cybernetics, (6), 448-452.
 Kubat, M., & Matwin, S. (1997, July). Addressing the curse of imbalanced training sets: one-sided selection. In ICML (Vol. 97, pp. 179-186).
 Laurikkala, J. (2001, July). Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe (pp. 63-66). Springer, Berlin, Heidelberg.
 Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
 Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878-887). Springer Berlin Heidelberg.
 He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on (pp. 1322-1328). IEEE.
 Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Advances in knowledge discovery and data mining, 475-482.
 Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.
 Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405-425.
 Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 20, 99-116.
 Rivera, W. A. (2017). Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets. Information Sciences, 408, 146-161.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.