divclust: Monothetic divisive hierarchical clustering

Description Usage Arguments Details Value See Also Examples

Description

DIVCLUS-T is a divisive hierarchical clustering algorithm based on a monothetic bipartitional approach allowing the dendrogram of the hierarchy to be read as a decision tree. It is designed for numerical, categorical (ordered or not) or mixed data. Like the Ward agglomerative hierarchical clustering algorithm and the k-means partitioning algorithm, it is based on the minimization of the inertia criterion. However, it provides a simple and natural monothetic interpretation of the clusters. Indeed, each cluster is decribed by set of binary questions. The inertia criterion is calculated on all the principal components of PCAmix (and then on standardized data in the numerical case).

Usage

1

Arguments

data

a data frame with numerical and/or categorical variables. If the variable is ordinal, the column must be of class factor with the argument ordered=TRUE.

K

the number of final clusters (leaves of the tree). By default, the complete dendrogram is performed.

Details

The tree has K leaves corresponding to a partition in K clusters if K is specified in input. Otherwise, each final cluster contains one observation and the tree is the complete dendrogram. The between-cluster inertia of the final partition of the leaves is the sum of the heights of the clusters in the tree. The total inertia for the quantitative dataset is equal to p1 (the number of quantitative variables). The total inertia for the qualitative dataset is m-p2 where m is the total number of categories and p2 is the number of qualitative variables. For a mixture of quantitative and qualitative data, the total variance is p1+m-p2. The quality of a partition is the proportion of inertia explained by the patition which is the between-cluster inertia divided by the total inertia. The height of a cluster in the dendrogram of divclust is the inertia variation which is also the aggregation criterion of Ward used in ascendant hierarchical clustering. This can be used, to help in the choice of the number of clusters as for Ward hierarchical clustering. For ordered qualitative variables (class factor with argument ordered =TRUE), this order on the categories is used to reduce the number of possible binary questions.

Value

tree

an internal tree

clusters

the list of observations in each final cluster (the leaves of the tree)

description

the monothetic description of each final cluster (the leaves of the tree)

which_cluster

a vector of integers indicating the final cluster of each observation

height

the height of the clusters in the dendrogram of the tree

B

the proportion of inertia explained by the final partition (between-cluster inertia/total inertia)

data_quanti

the quantitative data set

data_quali

the qualitative data set

mod_quali

the list of categories of qualitative variables

vec_quali

number of categories of each qualitative variable

kmax

the number of different observations i.e. the maximal number of leaves

T

The total inertia

See Also

plot.divclust cutreediv

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
data(protein) # pure quantitatives data
tree <- divclust(protein) # full clustering
plot(tree)
plot(1:(tree$kmax-1),tree$height,xlab="number of cluster",ylab="height",main="Split levels")
c_5 <- divclust(protein, K=5) # stops clustering to 5 clusters
plot(c_5,nqbin=4)
c_5$B*100 #explained inertia
c_5$clusters  # retrieve the list of observations in each cluster
c_5$description # and their monothetic description

data(dogs) # pure qualitative data
tree <- divclust(dogs) # full clustering
plot(tree)
plot(1:(tree$kmax-1),tree$height,xlab="number of cluster",ylab="height",main="Split levels")
c_4 <- divclust(dogs, K=4) # stops clustering to 4 clusters
plot(c_4)
c_4$clusters # retrieve the list of observations in each cluster
c_4$description # and their monothetic description
c_4$which_cluster # return a vector indicating to which cluster belongs each individual
c_4$B*100 #explained variance

dogs2 <- dogs # take the order of categories into account (to reduce the complexity)
levels(dogs$Size)
size2 <- factor(dogs$Size,c("small","large","medium")) #changes the order of the levels
levels(size2)
dogs2$Size <- ordered(size2) #specify argument ordered=TRUE in the class factor
tree <- divclust(dogs2) # full clustering with variable Size considered as ordered.
plot(tree) #the constraint on the order changes the clustering
data(wine) # mixed data
data <- wine[,1:29]
c_tot <- divclust(data) # full clustering
plot(c_tot)
c_4 <- divclust(data, 4) # stops clustering to 4 clusters
plot(c_4)
p2 <- length(c_4$vec_quali)
p1 <- ncol(data)-p2
sum(c_4$height)/(p1+sum(c_4$vec_quali)-p2)*100 #explained variance
c_tot$tree$v #internal contain of the root node
c_tot$tree$r$v #internal contain of the right node of the root node

chavent/divclust documentation built on May 13, 2019, 3:38 p.m.