cec: Cross-Entropy Clustering

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/cec.R

Description

Performs Cross-Entropy Clustering on a data matrix.

Usage

1
2
3
4
5
cec(x, centers, type = c("covariance", "fixedr", "spherical", "diagonal",  
"eigenvalues", "mean", "all"), iter.max = 25, nstart = 1, param,
centers.init = c("kmeans++", "random"), card.min = "5%", keep.removed = F, 
interactive = F, threads = 1, split = F, split.depth = 8, split.tries = 5, 
split.limit = 100, split.initial.starts = 1,readline = T)

Arguments

x

Numeric matrix of data.

centers

Either a matrix of initial centers or the number of initial centers (k, single number cec(data, 4, ...) or a vector for variable number of centers cec(data, 3:10, ...)).

If centers is a vector, length(centers) clusterings will be performed for each start (nstart argument) and the total number of clusterings will be length(centers) * nstart.

If centers is a number or a vector, initial centers will be generated using a method depending on the centers.init argument.

type

Type (or types) of clustering (density family). This can be either a single value or a vector of length equal to the number of centers. Possible values are: "covariance", "fixedr", "spherical", "diagonal", "eigenvalues", "all" (default).

Currently, if the centers argument is a vector, only signle type can be used.

iter.max

Maximum number of iterations at each clustering.

nstart

The number of clusterings to perform (with different initial centers). Only the best clustering (with the lowest cost) will be returned. Value grater then one is valid only if the centers argument is a number or a vector.

If the centers argument is a vector, length(centers) clusterings will be performed for each start and the total number of clusterings will be length(centers) * nstart.

If the split mode is on (split = T), it's rarely desired use change this parameter as the whole procedure (initial clustering + split) will be performed nstart times.

centers.init

Centers initialization method. Possible values are: "kmeans++" (default), "random".

param

Parameter (or parameters) specific to a particular type of clustering. Not all types of clustering require parameter. Types that require parameter: "covariance" (matrix parameter), "fixedr" (numeric parameter), "eigenvalues" (vector parameter). This can be a vector or a list (when one of the parameters is a matrix or a vector).

card.min

Minimal cluster cardinality. If cluster cardinality becomes less than card.min, cluster is removed. This argument can be either an integer number or a string ended with a percent sign (e.g. "5%").

keep.removed

If this parameter is TRUE, removed clusters will be visible in the results as NA in centers matrix (as well as corresponding values in the list of covariances).

interactive

Interactive mode. If TRUE, the result of clustering will be plotted after every iteration.

threads

Specifies the number of threads to use or "auto" to use default number of threads (usually the number of available processing units/cores) when performing multiple starts (nstart parameter).

The execution of a single start is always performed by a single thread, thus for nstart = 1 only one thread will be used regardless of the value of this parameter.

split

Enables split mode. This mode discovers new clusters after initial clustering, by trying to split single clusters into two to lower the cost function.

For each start (nstart), initial clustering will be performed and then split. The number of starts in the initial clustering before split is driven by the split.initial.starts parameter.

split.depth

Cluster subdivision depth used in split mode. Usually a value less than 10 is sufficient (when after each subdivision, new clusters have similar sizes). For some data, subdivisions may often produce a cluster (one of the two) that will not be split further, in that case a higher value of the split.depth is required.

split.tries

The number of attempts that are made when trying to split a cluster in split mode.

split.limit

Maximum number of centers to be discovered in split mode.

split.initial.starts

The number of 'standard' starts performed before starting split.

readline

Used only in the interactive mode. If readline is TRUE, at each iteration, before plotting it will wait for the user to press <Return> instead of standard "before plotting" (par(ask = TRUE)) waiting.

Details

In the context of implementation, Cross-Entropy Clustering (CEC) aims to partition m points into k clusters so as to minimize the cost function (energy E of the clustering) by switching the points between clusters. The presented method is based on the adapted Hartigan approach, where we reduce clusters which cardinalities decreased below some small prefixed level.

The energy function E is given by:

E(Y1, F1; ...; Yk, Fk) = ∑(p(Yi) * (-ln(p(Yi)) + H(Yi | Fi)))

where Yi denotes the i-th cluster, p(Yi) is the ratio of the number of points in i-th cluster to the total number points, H(Yi|Fi) is the value of cross-entropy, which represents the internal cluster energy function of data Yi defined with respect to a certain Gaussian density family Fi, which encodes the type of clustering we consider.

The value of the internal energy function H depends on the covariance matrix (computed using maximum-likelihood method) and the mean (in case of the mean model) of the points in the cluster. Seven implementations of H have been proposed (expressed as a type - model - of the clustering):

The implementation of cec function allows mixing of clustering types.

Value

Returns an object of class "cec" with available components: "data", "cluster", "probabilities", "centers", "cost.function", "nclusters", "iterations", "cost", "covariances", "covariances.model", "time".

Author(s)

Konrad Kamieniecki, Jacek Tabor, Przemys<c5><82>aw Spurek

References

Spurek, P. and Tabor, J. (2014) Cross-Entropy Clustering Pattern Recognition 47, 9 3046–3059

See Also

CEC-package.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#
# Cross-Entropy Clustering
#

## Example of clustering random data set of 3 Gaussians, 
## 10 random initial centers and 7% as minimal cluster size.

m1 = matrix(rnorm(2000, sd=1), ncol=2)
m2 = matrix(rnorm(2000, mean = 3, sd = 1.5), ncol = 2)
m3 = matrix(rnorm(2000, mean = 3, sd = 1), ncol = 2)
m3[,2] = m3[,2] - 5
m = rbind(m1, m2, m3)
par(ask = TRUE)
plot(m, cex = 0.5, pch = 19)
## Clustering result:
Z = cec(m, 10, iter.max = 100, card.min="7%")
plot(Z)
# Result:
Z
## Example of clustering mouse-like set using spherical Gaussian densities.
m = mouseset(n=7000, r.head=2, r.left.ear=1.1, r.right.ear=1.1, left.ear.dist=2.5,
right.ear.dist=2.5, dim=2)
plot(m, cex = 0.5, pch = 19)
## Clustering result:
Z = cec(m, 3, type="sp", iter.max = 100, nstart=4, card.min="5%")
plot(Z)
# Result:
Z

## Example of clustering data set "Tset" using "eigenvalues" clustering type.
data(Tset)
plot(Tset, cex = 0.5, pch = 19)
centers = init.centers(Tset, 2)
## Clustering result:
Z <- cec(Tset, 5, "eigenvalues", param=c(0.02,0.002), nstart=4)
plot(Z)
# Result:
Z

## Example of using CEC split method starting with a single cluster.
data(mixShapes)
plot(mixShapes, cex = 0.5, pch = 19)
## Clustering result:
Z <- cec(mixShapes, 1, split=TRUE)
plot(Z)
# Result:
Z

Example output

CEC clustering result: 

Probability vector:
[1] 0.3323333 0.3540000 0.3136667

Means of clusters:
           [,1]         [,2]
[1,] 3.09326320 -2.003513610
[2,] 0.02232949 -0.005592356
[3,] 3.08756948  3.186625149

Cost function:
[1] 4.110942

Number of clusters:
[1] 3

Number of iterations:
[1] 21

Computation time:
[1] 0.032

Available components:
 [1] "data"              "cluster"           "probabilities"    
 [4] "centers"           "cost.function"     "nclusters"        
 [7] "iterations"        "covariances"       "covariances.model"
[10] "time"             
CEC clustering result: 

Probability vector:
[1] 0.1878571 0.1784286 0.6337143

Means of clusters:
             [,1]        [,2]
[1,] -1.821308170  1.81283673
[2,]  1.820294791  1.85661636
[3,]  0.006953714 -0.09758019

Cost function:
[1] 3.23323

Number of clusters:
[1] 3

Number of iterations:
[1] 15

Computation time:
[1] 0.098

Available components:
 [1] "data"              "cluster"           "probabilities"    
 [4] "centers"           "cost.function"     "nclusters"        
 [7] "iterations"        "covariances"       "covariances.model"
[10] "time"             
CEC clustering result: 

Probability vector:
[1] 0.3646778 0.1422434 0.3536993 0.1393795

Means of clusters:
          [,1]      [,2]
[1,] 0.4794157 0.2081635
[2,] 0.7600415 0.9506202
[3,] 0.4807913 0.7344452
[4,] 0.2100561 0.9512146

Cost function:
[1] -0.8761754

Number of clusters:
[1] 4

Number of iterations:
[1] 18

Computation time:
[1] 0.302

Available components:
 [1] "data"              "cluster"           "probabilities"    
 [4] "centers"           "cost.function"     "nclusters"        
 [7] "iterations"        "covariances"       "covariances.model"
[10] "time"             
CEC clustering result: 

Probability vector:
[1] 0.1435556 0.1427778 0.1404444 0.1453333 0.1401111 0.1450000 0.1427778

Means of clusters:
          [,1]      [,2]
[1,] 485.59620 168.18558
[2,] 368.08445 203.08078
[3,] 470.67809  30.09067
[4,]  79.96403 263.55175
[5,] 205.68965 399.95641
[6,] 160.00748 310.04231
[7,] 200.07333 100.05577

Cost function:
[1] 10.14958

Number of clusters:
[1] 7

Number of iterations:
[1] 3

Computation time:
[1] 0.298

Available components:
 [1] "data"              "cluster"           "probabilities"    
 [4] "centers"           "cost.function"     "nclusters"        
 [7] "iterations"        "covariances"       "covariances.model"
[10] "time"             

CEC documentation built on May 2, 2019, 1:59 p.m.