findk: Estimate the Number of Clusters in a Data Set

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/kpeaks.R

Description

Based on some of descriptive statistics of the peak counts in the frequency polygon of a feature, this function proposes a list of estimates of the number of clusters in a data set.

Usage

1
findk(x, binrule, nbins, tcmethod, tc, trmethod, tv, rms=FALSE, rcs=FALSE, tpc=1)

Arguments

x

a numeric data frame or matrix.

binrule

a string specifying the binning rule to compute the number of classes of a frequency polygon.

nbins

an integer specifying the number of classes (bins). It is internally computed according to the selected binning rule except usr. See all available options in genpolygon.

tcmethod

a string representing a threshold method to compute a threshold distance value to discard the small or empty bins of a frequency polygon. See all available options in findpolypeaks.

tc

an integer for threshold frequency value assigned by tcmethod.

trmethod

a string used to specify a removal method to discard the shoulders around the main peaks in a frequency polygon. See all available options in rmshoulders.

tv

a numeric threshold distance value assigned by trmethod.

rms

a logical value whether the shoulders removal is applied or not. Default value is FALSE.

rcs

a logical value whether the estimates of k computed on the reduced counts set instead of the full set. Default value is FALSE, and set to TRUE in order to use the reduced counts set.

tpc

an integer threshold value for creating the reduced set of the peak counts. Default value is 1.

Details

The function findk returns a list of k values which are proposed as the estimates of the number of clusters in a given data set. The estimation is based on various descriptive statistics of the peak counts in the frequency polygon of the features. Firstly, the classes of frequency polygons of the features are generated by using the function genpolygon. Then, the main peaks in frequency polygons are determined by using the function findpolypeaks. If desired, with the function rmshoulders the shoulder peaks are removed from the peaks matrix returned by the function findpolypeaks. In the returned peaks matrix, the peaks are counted for each feature, and a list of estimates of k is produced by using various descriptive statistics of the peak counts.

Value

a list of the estimates of k consists of the following items which are computed from the peak counts of the features in a given data set:

am

arithmetic mean of peak counts.

med

median of peak counts.

mod

mode of peak counts.

cr

center of the range of peak counts.

ciqr

center of the interquartile range (IQR) of peak counts.

mppc

overall mean of the pairwise means of peak counts.

mq3m

mean of the third quartile (Q3) and maximum of peak counts.

mtl

mean of two largest value of peak counts.

avgk

proposed k as the mean of all the estimates.

modk

proposed k as the mode of all the estimates.

mtlk

proposed k as the mean of two largest estimates.

dst

a string representing the type of counts set which is used in computations.

pcounts

an integer vector containing the peak counts of the features.

Note

The input arguments of the function findk usually are the outputs from the functions findpolypeaks and rmshoulders.

Author(s)

Zeynel Cebeci, Cagatay Cebeci

References

Cebeci, Z. & Cebeci, C. (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.

Cebeci, Z. & Cebeci, C. (2018). "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.

See Also

findpolypeaks, rmshoulders

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Estimate the number of clusters in x5p4c data set
data(x5p4c)
estk <- findk(x5p4c, binrule="sturges")
print(estk)
summary(estk$pcounts)
cat("Estimated the number of clusters as the mean of Q3 and max peak count:", estk$mq3m, fill=TRUE)
cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)

# Estimate the number of clusters in x5p4c data set by using threshold frequency method 'avg' 
# and shoulders removal method 'q1'
estk <- findk(x5p4c, binrule="usr", nbins=15, tcmethod="usr", tc=1, trmethod="avg", rms=TRUE)
print(estk)
summary(estk$pcounts)
cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)

# Estimate the number of clusters in iris data set
data(iris)
estk <- findk(iris[,1:4], binrule="bc", rcs=FALSE)
print(estk)
summary(estk$pcounts)
cat("Proposed number of clusters based on the mean of estimates:", estk$avgk, fill=TRUE)
cat("Proposed number of clusters based on the mode of estimates:", estk$modk, fill=TRUE)
cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)

Example output

$am
[1] 3

$med
[1] 2

$mod
[1] 2

$mppc
[1] 2

$cr
[1] 2

$ciqr
[1] 2

$mq3m
[1] 4

$mtl
[1] 4

$avgk
[1] 3

$modk
[1] 2

$mtlk
[1] 4

$dst
[1] "Full"

$pcounts
[1] 1 4 2 3 2

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     2.0     2.0     2.4     3.0     4.0 
Estimated the number of clusters as the mean of Q3 and max peak count: 4
Proposed number of clusters based on the mean of two largest estimates: 4
$am
[1] 3

$med
[1] 2

$mod
[1] 2

$mppc
[1] 2

$cr
[1] 2

$ciqr
[1] 2

$mq3m
[1] 3

$mtl
[1] 3

$avgk
[1] 2

$modk
[1] 2

$mtlk
[1] 3

$dst
[1] "Full"

$pcounts
[1] 2 1 2 3 3

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     2.0     2.0     2.2     3.0     3.0 
Proposed number of clusters based on the mean of two largest estimates: 3
$am
[1] 2

$med
[1] 2

$mod
[1] 2

$mppc
[1] 2

$cr
[1] 2

$ciqr
[1] 2

$mq3m
[1] 3

$mtl
[1] 2

$avgk
[1] 2

$modk
[1] 2

$mtlk
[1] 2

$dst
[1] "Full"

$pcounts
[1] 2 1 2 3

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.75    2.00    2.00    2.25    3.00 
Proposed number of clusters based on the mean of estimates: 2
Proposed number of clusters based on the mode of estimates: 2
Proposed number of clusters based on the mean of two largest estimates: 2

kpeaks documentation built on April 14, 2020, 7:37 p.m.