SBC | R Documentation |
A balanced dataset would be return by using under-sampling based on clustering (SBC) algorithm.
SBC(data, outcome, perc_min = 100, k = 3, iter_max = 100, nstart = 1, ...)
data |
A dataset containing the predictors and the outcome. The predictors
can be continuous ( |
outcome |
The column number or the name of the outcome variable in the dataset. |
perc_min |
The desired percentage of the size of minority samples that the majority samples would be reached in the new dataset. The default is 100. |
k |
The number of clusters for the clustering algorithm. The default is 3. |
iter_max |
The maximum number of iterations of the clustering algorithm. The default is 100. |
nstart |
The initial number of random sets would be chosen. Only would be
used for |
... |
Not used. |
The under-sampling based on clustering algorithm clusters all samples into
k
clusters. Then it randomly selects the majority samples by considering
the ratio of the number of majority samples to the number of minority samples
in the cluster.
If we need to sample more majority samples than what is available in the cluster, the sampling with replacement would be used. Otherwise, the sampling without replacement would be used.
For the dataset with predictors that are all continuous (numeric
or
integer
), k-means
would be used to cluster. For the dataset with
predictors that are all categorical (character
or factor
),
k-modes
would be used. For the dataset with predictors are continuous or
categorical, k-prototypes
would be used.
A new dataset has been balanced.
Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.
Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, 3(8), 34-39.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283-304.
data(abalone)
data(bank)
table(abalone$Class)
table(bank$deposit)
# predictors are continuous or categorical
newdata1 <- SBC(bank, 'deposit')
table(newdata1$deposit)
newdata2 <- SBC(bank, 'deposit', perc_min=200)
table(newdata2$deposit)
# predictors are all continuous
newdata3 <- SBC(abalone, 'Class')
table(newdata3$Class)
# predictors are all categorical
bank1 <- bank[, c(2, 3, 5, 11)]
newdata4 <- SBC(bank1, 'deposit')
table(newdata4$deposit)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.