pdfCluster-package: The pdfCluster package: summary information

Description Details Author(s) References Examples

Description

This package performs cluster analysis via kernel density estimation (Azzalini and Torelli, 2007; Menardi and Azzalini, 2014). Clusters are associated to the maximally connected components with estimated density above a threshold. As the threshold varies, these clusters may be represented according to a hierarchical structure in the form of a tree. Detection of the connected regions is conducted by means of the Delaunay tesselation when data dimensionality is low to moderate, following Azzalini and Torelli (2007). For higher dimensional data, detection of connected regions is performed according to the procedure described in Menardi and Azzalini (2013). In both cases, after that a number of high-density cluster-cores is identified, lower density data are allocated by following a supervised classification-like approach. The number of clusters, corresponding to the number of the modes of the estimated density, is automatically selected by the procedure. Diagnostics methods for evaluating the quality of clustering are also available (Menardi, 2011). Moreover, the package provides a routine to estimate the probability density function by kernel methods, given a set of data with arbitrary dimension. The main features of the package are described and illustrated in Azzalini and Menardi (2014).

Details

The pdfCluster-package makes use of classes and methods of the S4 system. It includes some foreign functions written in the C language: two of them compute the kernel density estimate of data and are interfaced by the R function kepdf. Other C routines included in the package allow for a quicker detection of the connected components of the subgraphs associated with the level sets of the data. Two of them are directly drawn from the homonymous ones in the spdep package.

Starting from version 1.0-0, new features have been introduced:

See examples below to understand how to set arguments of the main function of the package, in order to obtain the same results as the ones obtained with versions 0.1-x.

Author(s)

Adelchi Azzalini, Giovanna Menardi, Tiziana Rosolin

Maintainer: Giovanna Menardi <menardi at stat.unipd.it>

References

Azzalini, A., Menardi, G. (2014). Clustering via Nonparametric Density Estimation: The R Package pdfCluster. Journal of Statistical Software, 57(11), 1-26, URL http://www.jstatsoft.org/v57/i11/.

Azzalini A., Torelli N. (2007). Clustering via nonparametric density estimation. Statistics and Computing, 17, 71-80.

Menardi G. (2011). Density based Silhouette diagnostics for clustering methods. Statistics and Computing, 21, 295-308.

Menardi G., Azzalini, A. (2014). An advancement in clustering via nonparametric density estimation. Statistics and Computing, DOI: 10.1007/s11222-013-9400-x, to appear.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# load data
data(wine)
gr <- wine[, 1]

# select a subset of variables
x <- wine[, c(2, 5, 8)]

#density estimation
pdf <- kepdf(x)
summary(pdf)
plot(pdf)

#clustering
cl <- pdfCluster(x)
summary(cl)
plot(cl)

#comparison with original groups
table(groups(cl),gr)

#density based silhouette diagnostics
dsil <- dbs(cl)
plot(dsil)

##########
# higher dimensions

x <- wine[, -1]

#density estimation with adaptive bandwidth 
pdf <- kepdf(x, bwtype="adaptive")
summary(pdf)
#density plot is not much clear for high- dimensional data
#select a few variables
plot(pdf, indcol = c(1,4,7))

#clustering
#when dimension is >= 6, default method to find connected components is "pairs"
#density is better estimated by using an adaptive bandwidth
cl <- pdfCluster(x, bwtype="adaptive")
summary(cl)
plot(cl)

########
# this example shows how to set the arguments in function pdfCluster
# in order to obtain the same results as the ones of versions 0.1-x.
x <- wine[, c(2, 5, 8)]

# previous versions of the package 
# do not run
# old code: 
# cl <- pdfCluster(x)

# same result is obtained now obtained as follows:
cl <- pdfCluster(x, se=FALSE, hcores= TRUE, graphtype="delaunay", n.grid=50)

Example output

pdfCluster 1.0-2

PLEASE NOTE:  New features have been introduced in version 1.0-0
These involve some changes in the package options
see "help("pdfCluster-package")" for the setting which
reproduce the functioning of the previous versions. 


An S4 object of class "kepdf"

The highest density data point has position 52 in the sample data 
 
Rows of  75 % top density data points: 1 3 4 5 6 8 10 11 12 13 16 18 20 21 22 23 24 25 27 28 29 30 32 33 34 35 36 37 38 39 41 42 43 44 45 46 47 48 49 50 52 54 55 56 57 58 59 62 65 66 67 68 70 71 75 77 78 81 82 83 84 85 86 87 89 90 91 92 93 94 95 96 100 101 102 103 104 105 106 107 108 109 111 112 113 114 115 117 118 120 124 126 130 131 132 133 134 135 136 137 139 141 142 143 144 145 146 147 148 149 150 152 154 155 156 157 160 161 162 163 164 165 166 167 168 169 170 171 172 174 175 176 177 
 
Rows of  50 % top density data points: 1 6 8 10 11 12 13 16 20 21 22 23 24 25 27 29 32 33 34 35 37 38 41 42 43 44 45 47 48 49 50 52 54 55 57 58 59 65 68 71 75 81 82 86 87 91 93 94 96 101 102 103 104 105 106 107 109 112 113 117 118 120 130 132 134 136 139 141 142 144 146 148 149 150 152 155 156 160 161 162 163 164 167 168 172 174 175 176 177 
 
Rows of  25 % top density data points: 1 10 11 13 16 20 21 23 27 29 33 34 35 41 42 44 48 52 54 55 57 65 68 75 81 93 94 96 103 105 106 107 109 120 132 134 141 142 146 148 149 156 168 176 177 
 

     PLEASE NOTE:  As of version 0.3-5, no degenerate (zero area) 
     regions are returned with the "Qt" option since the R 
     code removes them from the triangulation. 
     See help("delaunayn").


An S4 object of class "pdfCluster"

Call: pdfCluster(x = x)

Initial groupings: 
 label    1   2   3  NA 
 count   29  15  17 117 

Final groupings: 
 label   1  2  3 
 count  62 63 53 

Groups tree (here 'h' denotes 'height'):
--[dendrogram w/ 1 branches and 3 members at h = 1]
  `--[dendrogram w/ 2 branches and 3 members at h = 0.361]
     |--leaf "1 " 
     `--[dendrogram w/ 2 branches and 2 members at h = 0.333]
        |--leaf "2 " (h= 0.0556  )
        `--leaf "3 " (h= 0.0694  )
press <enter> to continue...
press <enter> to continue...
   gr
    Barolo Grignolino Barbera
  1     58          4       0
  2      1         62       0
  3      0          5      48
An S4 object of class "kepdf"

The highest density data point has position 57 in the sample data 
 
Rows of  75 % top density data points: 1 2 3 4 5 6 7 8 9 10 11 12 13 16 17 18 19 20 21 22 23 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 47 48 49 50 52 53 54 55 56 57 58 59 64 68 77 81 82 83 84 85 86 87 88 89 90 91 92 93 94 98 99 101 102 103 104 105 107 108 109 110 112 114 115 117 118 120 121 126 127 129 132 133 134 135 136 139 140 141 142 143 144 145 146 148 149 150 151 152 154 155 156 157 159 160 161 162 163 164 165 166 167 168 169 171 172 173 174 175 176 177 178 
 
Rows of  50 % top density data points: 1 6 7 8 9 10 11 12 13 16 17 18 19 21 23 24 25 27 28 29 30 32 33 35 36 37 38 39 41 43 45 48 49 50 52 53 54 55 56 57 58 59 64 68 82 83 86 87 88 89 90 91 92 93 98 99 102 103 104 105 107 108 112 115 117 126 132 133 134 139 140 141 143 148 149 150 156 157 161 162 163 164 165 166 168 171 173 174 175 
 
Rows of  25 % top density data points: 1 6 7 10 12 13 16 17 18 21 23 24 25 27 28 30 32 35 36 38 41 45 48 49 54 55 57 58 59 91 92 93 105 107 108 117 126 132 141 149 163 164 165 173 175 
 
An S4 object of class "pdfCluster"

Call: pdfCluster(x = x, bwtype = "adaptive")

Initial groupings: 
 label    1   2   3   4   5   6  NA 
 count    5   2  10   4   5   3 149 

Final groupings: 
 label   1  2  3  4  5  6 
 count  35 32 36 40 20 15 

Groups tree (here 'h' denotes 'height'):
--[dendrogram w/ 1 branches and 6 members at h = 1]
  `--[dendrogram w/ 2 branches and 6 members at h = 0.611]
     |--[dendrogram w/ 1 branches and 4 members at h = 0.319]
     |  `--[dendrogram w/ 3 branches and 4 members at h = 0.278]
     |     |--[dendrogram w/ 2 branches and 2 members at h = 0.0417]
     |     |  |--leaf "2 " (h= 0.0139  )
     |     |  `--leaf "1 " 
     |     |--leaf "4 " (h= 0.125  )
     |     `--leaf "3 " (h= 0.111  )
     `--[dendrogram w/ 2 branches and 2 members at h = 0.319]
        |--leaf "5 " (h= 0.153  )
        `--leaf "6 " (h= 0.181  )
press <enter> to continue...
press <enter> to continue...

pdfCluster documentation built on May 29, 2017, 9:08 p.m.