SBC: Under-Sampling Based on Clustering Algorithm

View source: R/SBC.R

SBCR Documentation

Under-Sampling Based on Clustering Algorithm

Description

A balanced dataset would be return by using under-sampling based on clustering (SBC) algorithm.

Usage

SBC(data, outcome, perc_min = 100, k = 3, iter_max = 100, nstart = 1, ...)

Arguments

data

A dataset containing the predictors and the outcome. The predictors can be continuous (numeric or integer) or catigorical (character or factor). The outcome must be binary.

outcome

The column number or the name of the outcome variable in the dataset.

perc_min

The desired percentage of the size of minority samples that the majority samples would be reached in the new dataset. The default is 100.

k

The number of clusters for the clustering algorithm. The default is 3.

iter_max

The maximum number of iterations of the clustering algorithm. The default is 100.

nstart

The initial number of random sets would be chosen. Only would be used for k-means and k-prototypes. The default is 1.

...

Not used.

Details

The under-sampling based on clustering algorithm clusters all samples into k clusters. Then it randomly selects the majority samples by considering the ratio of the number of majority samples to the number of minority samples in the cluster.

If we need to sample more majority samples than what is available in the cluster, the sampling with replacement would be used. Otherwise, the sampling without replacement would be used.

For the dataset with predictors that are all continuous (numeric or integer), k-means would be used to cluster. For the dataset with predictors that are all categorical (character or factor), k-modes would be used. For the dataset with predictors are continuous or categorical, k-prototypes would be used.

Value

A new dataset has been balanced.

References

Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.

Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, 3(8), 34-39.

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283-304.

Examples

data(abalone)
data(bank)
table(abalone$Class)
table(bank$deposit)

# predictors are continuous or categorical
newdata1 <- SBC(bank, 'deposit')
table(newdata1$deposit)

newdata2 <- SBC(bank, 'deposit', perc_min=200)
table(newdata2$deposit)

# predictors are all continuous
newdata3 <- SBC(abalone, 'Class')
table(newdata3$Class)

# predictors are all categorical
bank1 <- bank[, c(2, 3, 5, 11)]
newdata4 <- SBC(bank1, 'deposit')
table(newdata4$deposit)

dongyuanwu/RSBID documentation built on May 20, 2024, 7:53 a.m.