inaparc-package: Initialization Algorithms for Partitioning Cluster Analysis

Description Details Author(s) References See Also

Description

Partitioning clustering algorithms divide data sets into k subsets or partitions which are so-called clusters. They require some initialization procedures for starting to partition the data sets. Initialization of cluster prototypes is one of such kind of procedures for most of the partitioning algorithms. Cluster prototypes are the data elements, i.e. centroids or medoids, representing the clusters in a data set. In order to initialize cluster prototypes, the package ‘inaparc’ contains a set of the functions that are the implementations of widely-used algorithms in addition to some novel techniques. Initialization of fuzzy membership degrees matrices is another important task for starting the probabilistic and possibilistic partitioning algorithms. In order to initialize membership degrees matrices required by these algorithms, the package ‘inaparc’ contains a number of functions for most of the data independent and dependent initialization techniques (Borgelt, 2005) which are categorized as the linear time-complexity and loglinear time complexity-initialization methods in Celebi et al (2013).

Details

Clustering is one of the most widely used exploratory statistical analysis in data mining. Its goal is to explore the groups of objects that are similar to each other within the group but different from the objects in other groups. According to a common taxonomy, the existing clustering algorithms are classified in two groups: Hierarchical and Non-hierarchical (or flat) algorithms (Rokah & Maimon, 2005). As a dominant subfamily of non-hierarchical algorithms, the partitioning clustering algorithms divide data objects into a pre-defined number of clusters, which are the non-overlapping subsets of data. Although the choice of an appropriate algorithm for any clustering task depends on many criteria or purposes. When data size and dimensions are the concerned criteria, the non-hierarchical algorithms may be more practical way of clustering the large size and high dimensional data sets because they quickly process the large data sets when compared to the hierarchical clustering algorithms.

As the most crowded group of the partitioning clustering tools, the prototype-based algorithms partition data objects into clusters in which each data object is more similar to its prototype than the prototypes of other clusters. On clustering context, a prototype is a typical data item that represents or characterizes a cluster (Tan et al. 2006). Usually, it can be regarded as the most central data point in a data subspace so-called cluster. The prototype of a cluster is so often a centroid, i.e., the mean of all the objects in a cluster. On the other hand, centroids can not be computed for non-numeric data, i.e., on nominal or ordinal data. In such case, medoids can be used as the prototypes of clusters (Tan et al, 2006).

Initialization or seeding is a process for selecting the starting values of cluster prototypes matrix which serves the initial representatives of clusters. It is an important task in partitioning cluster analysis because it is known that the final clustering result is to be highly sensitive to the initial prototypes of the clusters (Khan, 2012). When the prototypes are chosen to be equal or close to the actual centers of clusters in a data set, the partitioning converges quickly and yields quality results. Contrarily, poor initializations of prototype matrix may result with no-good quality of final partitions.

In fuzzy and possibilistic clustering, an object is a member of all clusters in varying degrees of membership instead of being a member of only one cluster. A membership degrees matrix is required by the fuzzy clustering algorithms, i.e., Fuzzy C-means (FCM) (Bezdek, 1981). Initialization of membership degrees for starting FCM and its various variants must satisfy the following constraints:

u_{ij}\in[0,1]; 1≤ i ≤ n, 1≤ j ≤ k

∑\limits_{j=1}^k u_{ij}=1; 1≤ i ≤ n

0<∑\limits_{i=1}^n u_{ij} < n ; 1≤ j ≤ k

Membership degrees matrices are usually initialized with the techniques based on random number generating as the function imembrand does. In addition to these common techiques, a novel technique using the information from synthetically produced classes over a selected feature is provided in the package ‘inaparc’. The novel technique which is implemented in imembucr may contribute to the fast convergence of the clustering algorithms when compared to the random sampling based techniques. The package also serves the functions for building hard or crisp membership degrees which can be used for testing purposes.

Author(s)

Zeynel Cebeci, Cagatay Cebeci

References

Bezdek J.C. (1981). Pattern recognition with fuzzy objective function algorithms. Plenum, NY, 256 p. <ISBN:0306406713>

Borgelt, C., (2005). Prototype-based classification and clustering. Habilitationsschrift zur Erlangung der Venia legendi fuer Informatik, vorgelegt der Fakultaet fuer Informatik der Otto-von-Guericke-Universitaet Magdeburg, Magdeburg, 22 June 2005. url:http://www.borgelt.net/habil/pbcc.pdf

Rokah, L. & Maimon, O. (2005). Clustering methods. In Data Mining and Knowledge Discovery Handbook (ed. O. Maimon), Springer US. pp. 321-352. url:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.9326&rep=rep1&type=pdf

Tan, P. N., Steinbach, M., & Kumar, V. (2006). Cluster analysis: Basic concepts and algorithms. In Introduction to Data Mining. Pearson Addison Wesley. url:http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf

Khan, F. (2012). An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application. Applied Soft Computing, 12 (11) : 3698-3700. doi:https://doi.org/10.1016/j.asoc.2012.07.021

Celebi, M.E., Kingravi, H.A. & Vela, P.A. (2013). A comparative study of efficient initialization methods for the K-means clustering algorithm, Expert Systems with Applications, 40 (1): 200-210. arXiv:https://arxiv.org/pdf/1209.1960.pdf

See Also

aldaoud, ballhall, crsamp, firstk, forgy, hartiganwong, imembones, imembrand, imembucr, inofrep, inscsf, insdev, is.inaparc, kkz, kmpp, ksegments, ksteps, lastk, lhsmaximin, lhsrandom, maximin, mscseek, rsamp, rsegment, scseek, scseek2, spaeth, ssamp, topbottom, uniquek, ursamp


inaparc documentation built on Nov. 17, 2017, 7:53 a.m.