scseek: Initialization of cluster prototypes using SCS algorithm

View source: R/inaparc.R

scseekR Documentation

Initialization of cluster prototypes using SCS algorithm

Description

Initializes the cluster prototypes matrix with the Simple Cluster Seeking (SCS) algorithm (Tou & Gonzales, 1974).

Usage

scseek(x, k, tv)

Arguments

x

a numeric vector, data frame or matrix.

k

an integer for the number of clusters.

tv

a number to be used as the threshold distance which is directly input by the user. Also it is possible to compute T, a threshold distance value with the following options of tv argument:

  • T is the mean of differences between the consecutive pairs of objects with the option cd1.

  • T is the minimum of differences between the consecutive pairs of objects with the option cd2.

  • T is the mean of Euclidean distances between the consecutive pairs of objects divided into k with the option md. This is the default if tv is not supplied by the user.

  • T is the range of maximum and minimum of Euclidean distances between the consecutive pairs of objects divided into k with the option mm.

Details

The algorithm Simple Cluster Seeking (SCS) (Tou & Gonzales, 1974) is similar to Ball and Hall's algorithm (Ball & Hall, 1967) with an exception for selection of the first object (Celebi et al, 2013). In SCS, the first object in the data set is selected as the prototype of the first cluster. Then, the next object whose distance to the first prototype is greater than T, a threshold distance value is seeked and assigned as the second cluster prototype, if found. Afterwards, the next object whose distance to already determined prototypes is greater than T is searched and assigned as the third cluster prototype. The selection process is repeated for determining the prototypes of remaining clusters in similar way.

Because SCS is sensitive to the order of the data (Celebi et al, 2013), it may not yield good initializations with the sorted data. On the other hand, the distance between the cluster prototypes can be controlled T, which is an arbitrary number specified by the user. But the problem is that how the user decides on this threshold value. As a solution to this problem in the function scseek, some internally computed distance measures can be used. (See the section‘Arguments’ above for the available options.)

Value

an object of class ‘inaparc’, which is a list consists of the following items:

v

a numeric matrix of the initial cluster prototypes.

ctype

a string representing the type of centroid, which used to build prototype matrix. Its value is ‘obj’ with this function because the cluster prototype matrix contains the objects.

call

a string containing the matched function call that generates this ‘inaparc’ object.

Author(s)

Zeynel Cebeci, Cagatay Cebeci

References

Ball, G.H. & Hall, D.J. (1967). A clustering technique for summarizing multivariate data, Systems Res. & Behavioral Sci., 12 (2): 153-155.

Tou, J.T. & Gonzalez,R.C. (1974). Pattern Recognition Principles. Addison-Wesley, Reading, MA. <ISBN:9780201075861>

Celebi, M.E., Kingravi, H.A. & Vela, P.A. (2013). A comparative study of efficient initialization methods for the K-means clustering algorithm, Expert Systems with Applications, 40 (1): 200-210. arXiv:https://arxiv.org/pdf/1209.1960.pdf

See Also

aldaoud, ballhall, crsamp, firstk, forgy, hartiganwong, inofrep, inscsf, insdev, kkz, kmpp, ksegments, ksteps, lastk, lhsmaximin, lhsrandom, maximin, mscseek, rsamp, rsegment, scseek2, spaeth, ssamp, topbottom, uniquek, ursamp

Examples

data(iris)
# Run with the threshold value of 0.5
res <- scseek(x=iris[,1:4], k=5, tv=0.5)
v1 <- res$v
print(v1)

# Run with the internally computed default threshold value 
res <- scseek(x=iris[,1:4], k=5)
v2 <- res$v
print(v2)

inaparc documentation built on June 16, 2022, 5:09 p.m.