SBC: The Under-Sampling Based on Clustering algorithm.

Description Usage Arguments Details Value References

Description

SBC under-samples the input data using the Under-Sampling Based on Clustering algorithm.

Usage

1
2
SBC(data, perc_maj = 50, perc_under = NULL, k = 3, max_iter = 100L,
  nstart = 10L, classes = NULL)

Arguments

data

A data frame containing the predictors and the outcome. The predictors must be numeric and the outcome must be both a binary valued factor and the last column of data.

perc_maj

The desired % size of the majority class relative to the whole data set. For instance, if perc_maj = 50 a balanced version of the input data set is returned. perc_maj is ignored if perc_under is specified.

perc_under

% of examples to select from the majority class. If specified perc_maj is ignored.

k

Number of clusters for the k-Means algorithm.

max_iter

Maximum number of iterations of the k-Means algorithm.

nstart

Number of random restarts of the k-Means algorithm.

classes

A named vector identifying the majority and the minority classes. The names must be "Majority" and "Minority". This argument is only useful if the function is called inside another sampling function.

Details

Under-Sampling Based on Clustering clusters the input data into k clusters and randomly selects a number of majority examples from each cluster based on the imbalance ratio of the cluster.

The authors did not specify if sampling of majority examples should be performed with or without replacement, however, in many occasions the algorithm tries to sample more examples than what is available in the cluster, therefore, we always perform sampling with replacement here.

Value

A data frame containing a more balanced version of the input data after under-sampling with the Under-Sampling Based on Clustering algorithm.

References

Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.


RomeroBarata/bimba documentation built on May 17, 2019, 8:03 a.m.