findk: Estimate the Number of Clusters in a Data Set
In kpeaks: Determination of K Using Peak Counts of Features for Clustering

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/kpeaks.R

Based on some of descriptive statistics of the peak counts in the frequency polygon of a feature, this function proposes a list of estimates of the number of clusters in a data set.

1	findk(x, binrule, nbins, tcmethod, tc, trmethod, tv, rms=FALSE, rcs=FALSE, tpc=1)

`x`	a numeric data frame or matrix.
`binrule`	a string specifying the binning rule to compute the number of classes of a frequency polygon.
`nbins`	an integer specifying the number of classes (bins). It is internally computed according to the selected binning rule except usr. See all available options in `genpolygon`.
`tcmethod`	a string representing a threshold method to compute a threshold distance value to discard the small or empty bins of a frequency polygon. See all available options in `findpolypeaks`.
`tc`	an integer for threshold frequency value assigned by `tcmethod`.
`trmethod`	a string used to specify a removal method to discard the shoulders around the main peaks in a frequency polygon. See all available options in `rmshoulders`.
`tv`	a numeric threshold distance value assigned by `trmethod`.
`rms`	a logical value whether the shoulders removal is applied or not. Default value is FALSE.
`rcs`	a logical value whether the estimates of `k` computed on the reduced counts set instead of the full set. Default value is FALSE, and set to `TRUE` in order to use the reduced counts set.
`tpc`	an integer threshold value for creating the reduced set of the peak counts. Default value is 1.

The function findk returns a list of k values which are proposed as the estimates of the number of clusters in a given data set. The estimation is based on various descriptive statistics of the peak counts in the frequency polygon of the features. Firstly, the classes of frequency polygons of the features are generated by using the function genpolygon. Then, the main peaks in frequency polygons are determined by using the function findpolypeaks. If desired, with the function rmshoulders the shoulder peaks are removed from the peaks matrix returned by the function findpolypeaks. In the returned peaks matrix, the peaks are counted for each feature, and a list of estimates of k is produced by using various descriptive statistics of the peak counts.

a list of the estimates of k consists of the following items which are computed from the peak counts of the features in a given data set:

`am`	arithmetic mean of peak counts.
`med`	median of peak counts.
`mod`	mode of peak counts.
`cr`	center of the range of peak counts.
`ciqr`	center of the interquartile range (IQR) of peak counts.
`mppc`	overall mean of the pairwise means of peak counts.
`mq3m`	mean of the third quartile (Q3) and maximum of peak counts.
`mtl`	mean of two largest value of peak counts.
`avgk`	proposed `k` as the mean of all the estimates.
`modk`	proposed `k` as the mode of all the estimates.
`mtlk`	proposed `k` as the mean of two largest estimates.
`dst`	a string representing the type of counts set which is used in computations.
`pcounts`	an integer vector containing the peak counts of the features.

The input arguments of the function findk usually are the outputs from the functions findpolypeaks and rmshoulders.

Zeynel Cebeci, Cagatay Cebeci

Cebeci, Z. & Cebeci, C. (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.

Cebeci, Z. & Cebeci, C. (2018). "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.

findpolypeaks, rmshoulders

# Estimate the number of clusters in x5p4c data set
data(x5p4c)
estk <- findk(x5p4c, binrule="sturges")
print(estk)
summary(estk$pcounts)
cat("Estimated the number of clusters as the mean of Q3 and max peak count:", estk$mq3m, fill=TRUE)
cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)

# Estimate the number of clusters in x5p4c data set by using threshold frequency method 'avg' 
# and shoulders removal method 'q1'
estk <- findk(x5p4c, binrule="usr", nbins=15, tcmethod="usr", tc=1, trmethod="avg", rms=TRUE)
print(estk)
summary(estk$pcounts)
cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)

# Estimate the number of clusters in iris data set
data(iris)
estk <- findk(iris[,1:4], binrule="bc", rcs=FALSE)
print(estk)
summary(estk$pcounts)
cat("Proposed number of clusters based on the mean of estimates:", estk$avgk, fill=TRUE)
cat("Proposed number of clusters based on the mode of estimates:", estk$modk, fill=TRUE)
cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)

$am
[1] 3

$med
[1] 2

$mod
[1] 2

$mppc
[1] 2

$cr
[1] 2

$ciqr
[1] 2

$mq3m
[1] 4

$mtl
[1] 4

$avgk
[1] 3

$modk
[1] 2

$mtlk
[1] 4

$dst
[1] "Full"

$pcounts
[1] 1 4 2 3 2

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     2.0     2.0     2.4     3.0     4.0 
Estimated the number of clusters as the mean of Q3 and max peak count: 4
Proposed number of clusters based on the mean of two largest estimates: 4
$am
[1] 3

$med
[1] 2

$mod
[1] 2

$mppc
[1] 2

$cr
[1] 2

$ciqr
[1] 2

$mq3m
[1] 3

$mtl
[1] 3

$avgk
[1] 2

$modk
[1] 2

$mtlk
[1] 3

$dst
[1] "Full"

$pcounts
[1] 2 1 2 3 3

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     2.0     2.0     2.2     3.0     3.0 
Proposed number of clusters based on the mean of two largest estimates: 3
$am
[1] 2

$med
[1] 2

$mod
[1] 2

$mppc
[1] 2

$cr
[1] 2

$ciqr
[1] 2

$mq3m
[1] 3

$mtl
[1] 2

$avgk
[1] 2

$modk
[1] 2

$mtlk
[1] 2

$dst
[1] "Full"

$pcounts
[1] 2 1 2 3

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.75    2.00    2.00    2.25    3.00 
Proposed number of clusters based on the mean of estimates: 2
Proposed number of clusters based on the mode of estimates: 2
Proposed number of clusters based on the mean of two largest estimates: 2