Description Details Author(s) References Examples

This package performs cluster analysis via kernel density estimation (Azzalini and Torelli, 2007; Menardi and Azzalini, 2014). Clusters are associated to the maximally connected components with estimated density above a threshold. As the threshold varies, these clusters may be represented according to a hierarchical structure in the form of a tree. Detection of the connected regions is conducted by means of the Delaunay tesselation when data dimensionality is low to moderate, following Azzalini and Torelli (2007). For higher dimensional data, detection of connected regions is performed according to the procedure described in Menardi and Azzalini (2013). In both cases, after that a number of high-density cluster-cores is identified, lower density data are allocated by following a supervised classification-like approach. The number of clusters, corresponding to the number of the modes of the estimated density, is automatically selected by the procedure. Diagnostics methods for evaluating the quality of clustering are also available (Menardi, 2011). Moreover, the package provides a routine to estimate the probability density function by kernel methods, given a set of data with arbitrary dimension. The main features of the package are described and illustrated in Azzalini and Menardi (2014).

The `pdfCluster-package`

makes use of classes and methods of the
S4 system.
It includes some foreign functions written in the C language: two of them
compute the kernel density estimate of data and are interfaced by the R
function `kepdf`

.
Other C routines included in the package allow for a quicker detection of the
connected components of the subgraphs associated with the level sets of the data.
Two of them are directly drawn from the homonymous ones in the `spdep`

package.

Starting from version 1.0-0, new features have been introduced:

kernel density estimation may be performed by using either a a fixed or an adaptive bandwidth; moreover, the option of selecting a Student's

*t*kernel has been included, for computational convenience;detection of connected components of the level sets is performed by means of the Delaunay triangulation when data dimensionality is up to 6, following Azzalini and Torelli (2007); for higher dimensional data a new procedure, which is less time-consuming, is now adopted (Menardi and Azzalini, 2014);

the order of classification of lower density data depends now also on the standard error of the estimated density ratios; moreover, a cluster-specific bandwidth is the default option to classify low density data.

See examples below to understand how to set arguments of the main function of the package, in order to obtain the same results as the ones obtained with versions 0.1-x.

Adelchi Azzalini, Giovanna Menardi, Tiziana Rosolin

Maintainer: Giovanna Menardi <menardi at stat.unipd.it>

Azzalini, A., Menardi, G. (2014). Clustering via Nonparametric Density Estimation: The R Package pdfCluster.
*Journal of Statistical Software*, 57(11), 1-26,
URL http://www.jstatsoft.org/v57/i11/.

Azzalini A., Torelli N. (2007). Clustering via nonparametric density estimation.
*Statistics and Computing*, 17, 71-80.

Menardi G. (2011). Density based Silhouette diagnostics for clustering methods.
*Statistics and Computing*, 21, 295-308.

Menardi G., Azzalini, A. (2014). An advancement in clustering via nonparametric density estimation.
*Statistics and Computing*, DOI: 10.1007/s11222-013-9400-x, to appear.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | ```
# load data
data(wine)
gr <- wine[, 1]
# select a subset of variables
x <- wine[, c(2, 5, 8)]
#density estimation
pdf <- kepdf(x)
summary(pdf)
plot(pdf)
#clustering
cl <- pdfCluster(x)
summary(cl)
plot(cl)
#comparison with original groups
table(groups(cl),gr)
#density based silhouette diagnostics
dsil <- dbs(cl)
plot(dsil)
##########
# higher dimensions
x <- wine[, -1]
#density estimation with adaptive bandwidth
pdf <- kepdf(x, bwtype="adaptive")
summary(pdf)
#density plot is not much clear for high- dimensional data
#select a few variables
plot(pdf, indcol = c(1,4,7))
#clustering
#when dimension is >= 6, default method to find connected components is "pairs"
#density is better estimated by using an adaptive bandwidth
cl <- pdfCluster(x, bwtype="adaptive")
summary(cl)
plot(cl)
########
# this example shows how to set the arguments in function pdfCluster
# in order to obtain the same results as the ones of versions 0.1-x.
x <- wine[, c(2, 5, 8)]
# previous versions of the package
# do not run
# old code:
# cl <- pdfCluster(x)
# same result is obtained now obtained as follows:
cl <- pdfCluster(x, se=FALSE, hcores= TRUE, graphtype="delaunay", n.grid=50)
``` |

```
pdfCluster 1.0-2
PLEASE NOTE: New features have been introduced in version 1.0-0
These involve some changes in the package options
see "help("pdfCluster-package")" for the setting which
reproduce the functioning of the previous versions.
An S4 object of class "kepdf"
The highest density data point has position 52 in the sample data
Rows of 75 % top density data points: 1 3 4 5 6 8 10 11 12 13 16 18 20 21 22 23 24 25 27 28 29 30 32 33 34 35 36 37 38 39 41 42 43 44 45 46 47 48 49 50 52 54 55 56 57 58 59 62 65 66 67 68 70 71 75 77 78 81 82 83 84 85 86 87 89 90 91 92 93 94 95 96 100 101 102 103 104 105 106 107 108 109 111 112 113 114 115 117 118 120 124 126 130 131 132 133 134 135 136 137 139 141 142 143 144 145 146 147 148 149 150 152 154 155 156 157 160 161 162 163 164 165 166 167 168 169 170 171 172 174 175 176 177
Rows of 50 % top density data points: 1 6 8 10 11 12 13 16 20 21 22 23 24 25 27 29 32 33 34 35 37 38 41 42 43 44 45 47 48 49 50 52 54 55 57 58 59 65 68 71 75 81 82 86 87 91 93 94 96 101 102 103 104 105 106 107 109 112 113 117 118 120 130 132 134 136 139 141 142 144 146 148 149 150 152 155 156 160 161 162 163 164 167 168 172 174 175 176 177
Rows of 25 % top density data points: 1 10 11 13 16 20 21 23 27 29 33 34 35 41 42 44 48 52 54 55 57 65 68 75 81 93 94 96 103 105 106 107 109 120 132 134 141 142 146 148 149 156 168 176 177
PLEASE NOTE: As of version 0.3-5, no degenerate (zero area)
regions are returned with the "Qt" option since the R
code removes them from the triangulation.
See help("delaunayn").
An S4 object of class "pdfCluster"
Call: pdfCluster(x = x)
Initial groupings:
label 1 2 3 NA
count 29 15 17 117
Final groupings:
label 1 2 3
count 62 63 53
Groups tree (here 'h' denotes 'height'):
--[dendrogram w/ 1 branches and 3 members at h = 1]
`--[dendrogram w/ 2 branches and 3 members at h = 0.361]
|--leaf "1 "
`--[dendrogram w/ 2 branches and 2 members at h = 0.333]
|--leaf "2 " (h= 0.0556 )
`--leaf "3 " (h= 0.0694 )
press <enter> to continue...
press <enter> to continue...
gr
Barolo Grignolino Barbera
1 58 4 0
2 1 62 0
3 0 5 48
An S4 object of class "kepdf"
The highest density data point has position 57 in the sample data
Rows of 75 % top density data points: 1 2 3 4 5 6 7 8 9 10 11 12 13 16 17 18 19 20 21 22 23 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 47 48 49 50 52 53 54 55 56 57 58 59 64 68 77 81 82 83 84 85 86 87 88 89 90 91 92 93 94 98 99 101 102 103 104 105 107 108 109 110 112 114 115 117 118 120 121 126 127 129 132 133 134 135 136 139 140 141 142 143 144 145 146 148 149 150 151 152 154 155 156 157 159 160 161 162 163 164 165 166 167 168 169 171 172 173 174 175 176 177 178
Rows of 50 % top density data points: 1 6 7 8 9 10 11 12 13 16 17 18 19 21 23 24 25 27 28 29 30 32 33 35 36 37 38 39 41 43 45 48 49 50 52 53 54 55 56 57 58 59 64 68 82 83 86 87 88 89 90 91 92 93 98 99 102 103 104 105 107 108 112 115 117 126 132 133 134 139 140 141 143 148 149 150 156 157 161 162 163 164 165 166 168 171 173 174 175
Rows of 25 % top density data points: 1 6 7 10 12 13 16 17 18 21 23 24 25 27 28 30 32 35 36 38 41 45 48 49 54 55 57 58 59 91 92 93 105 107 108 117 126 132 141 149 163 164 165 173 175
An S4 object of class "pdfCluster"
Call: pdfCluster(x = x, bwtype = "adaptive")
Initial groupings:
label 1 2 3 4 5 6 NA
count 5 2 10 4 5 3 149
Final groupings:
label 1 2 3 4 5 6
count 35 32 36 40 20 15
Groups tree (here 'h' denotes 'height'):
--[dendrogram w/ 1 branches and 6 members at h = 1]
`--[dendrogram w/ 2 branches and 6 members at h = 0.611]
|--[dendrogram w/ 1 branches and 4 members at h = 0.319]
| `--[dendrogram w/ 3 branches and 4 members at h = 0.278]
| |--[dendrogram w/ 2 branches and 2 members at h = 0.0417]
| | |--leaf "2 " (h= 0.0139 )
| | `--leaf "1 "
| |--leaf "4 " (h= 0.125 )
| `--leaf "3 " (h= 0.111 )
`--[dendrogram w/ 2 branches and 2 members at h = 0.319]
|--leaf "5 " (h= 0.153 )
`--leaf "6 " (h= 0.181 )
press <enter> to continue...
press <enter> to continue...
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.