pdfCluster  R Documentation 
Cluster analysis is performed by the densitybased procedures described in Azzalini and Torelli (2007) and Menardi and Azzalini (2014), and summarized in Azzalini and Menardi (2014).
## S4 method for signature 'numeric' pdfCluster(x, graphtype, hmult, Q = "QJ", lambda = 0.1, grid.pairs = 10, n.grid = min(round((5 + sqrt(NROW(x))) * 4), NROW(x)), ...) ## S4 method for signature 'matrix' pdfCluster(x, graphtype, hmult, Q = "QJ", lambda = 0.1, grid.pairs = 10, n.grid = min(round((5 + sqrt(NROW(x))) * 4), NROW(x)), ...) ## S4 method for signature 'data.frame' pdfCluster(x, graphtype, hmult, Q = "QJ", lambda = 0.1, grid.pairs = 10, n.grid = min(round((5 + sqrt(NROW(x))) * 4), NROW(x)), ...) ## S4 method for signature 'pdfCluster' pdfCluster(x, graphtype, hmult, Q, lambda = 0.1, grid.pairs, n.grid = min(round((5 + sqrt(NROW(x@x))) * 4), NROW(x@x)), ...)
x 
A vector, a matrix or a data frame of numeric data to be partitioned.
Since densitybased clustering is designed for continuous data only,
if discrete data are provided, a warning message is displayed.
Alternatively, 
graphtype 
Either "unidimensional", "delaunay" or "pairs", it defines the procedure used
to build the graph associated with the data. If missing, a "delaunay" graph is
built for data having dimension less than 7, otherwise a "pairs" graph is built.
See details below. This argument has not to be set when 
hmult 
A shrink factor to be multiplied by the smoothing parameter 
Q 
Optional arguments to be given when 
lambda 
Tolerance threshold to be used when 
grid.pairs 
When 
n.grid 
Defines the length of the grid on which evaluating the connected components of the density level sets. The default value is set to the minimum between the number of data rows n and \lfloor{(5 + √(n))4 + 0.5}\rfloor, an empirical rule of thumb which indicates that the length of the grid grows with the square root of the number of rows data. 
... 
Further arguments to be passed to 
Clusters are associated to the connected components of the level sets of the density underlying the data. Density estimation is performed by kernel methods and the connected regions are approximated by the connected components of a graph built on data. Three alternative procedures to build the graph are adopted:
When data are univariate an edge is set between two observations when all the data points included in the segment between the two candidate observations belong to the same level set.
An edge is set between two observations when they are contiguous in the Voronoi diagram; see Azzalini and Torelli (2007).
An edge is set between two observations when the density function, evaluated along the segment joining them, does not exhibit any valley having a relative amplitude greater than a tolerance threshold 0 ≤ λ < 1. Being a tolerance threshold, sensible values of λ are, in practice, included in [0, 0.3]; see Menardi and Azzalini (2013).
As the level set varies, the number of detected components gives rise to the
tree of clusters, where each leave corresponds to a mode of the density
function. Observations around these modes form a number of cluster cores,
while the lower density observations are allocated according to a
classification procedure; see also pdfClassification
.
An S4 object of pdfClusterclass
with slots:
call 
The matched call. 
x 
The matrix of data input. If a vector of data is provided as input, a onecolumn matrix is returned as output. 
pdf 
An object of class
See 
nc 
An object of class

graph 
An object of class

cluster.cores 
A vector with the same length as 
tree 
Cluster tree with leaves corresponding to the connected
components associated to different sections of the density estimate.
The object is of class 
noc 
Number of clusters. 
stages 
List with elements corresponding to the data allocation to
groups at the different stages of the classification procedure.

clusters 
Set to 
signature(x="data.frame")
This method applies the pdfCluster
procedure to objects of class
data.frame
.
signature(x="matrix")
This method applies the pdfCluster
procedure to objects of class
matrix
.
signature(x="numeric")
This method applies the pdfCluster
procedure to objects of class
numeric
.
signature(x="pdfCluster")
This method applies to objects of pdfClusterclass
when the graph
has been built according to the "pairs" procedure. It allows to save time and
computations if the user wants to compare results of cluster analysis for
different values of the lambda
parameter. See examples below.
It may happen that the variability of the estimated density is so high that not
all jumps in the mode function can be detected by the selected grid scanning
the density function. In that case, no output is produced and a message is displayed.
As this may be associated to the occurrence of some spurious connected components,
which appear and disappear within the range between two subsequent values of the grid,
a natural solution is to increase the value of n.grid
.
Alternatively either lambda
or hmult
may be increased to alleviate
the chance of detecting spurious connected components.
Using graphtype= 'delaunay'
when the dimensionality d of data is
greater than 6 is highly timeconsuming unless the number of rows n
is fairly small, since the number of operations to run the Delaunay triangulation
grows exponentially with d.
Use graphtype= "pairs"
, instead, whose computational complexity grows quadratically
with the number of observations.
Azzalini, A., Menardi, G. (2014). Clustering via nonparametric density estimation: the R package pdfCluster. Journal of Statistical Software, 57(11), 126, URL http://www.jstatsoft.org/v57/i11/.
Azzalini A., Torelli N. (2007). Clustering via nonparametric density estimation. Statistics and Computing. 17, 7180.
Menardi, G., Azzalini, A. (2014). An advancement in clustering via nonparametric density estimation. Statistics and Computing. DOI: 10.1007/s112220139400x.
kepdf
, pdfClusterclass
, pdfClassification
.
########## #example 1 ########### # not run here for time reasons #loading data data(oliveoil) #preparing data olive1 < 1 + oliveoil[, 3:10] margin < apply(data.matrix(olive1),1,sum) olive1 < olive1/margin alr < (log( olive1[, 4]/olive1[, 4])) #select the first 5 principal components x < princomp(alr, cor=TRUE)$scores[, 1:5] #clustering # not run here for time reasons #cl < pdfCluster(x, h = h.norm(x), hmult=0.75) #summary(cl) #plot(cl) #comparing groups with original macroarea membership #groups < groups(cl) #table(oliveoil$macro.area, groups) #cluster cores #table(groups(cl, stage = 0)) ########## #example 2 ########### # not run here for time reasons # loading data #data(wine) #x <wine[ ,1] #gr < wine[ ,1] # when data are highdimensional, an adaptive kernel estimator is preferable # building the Delaunay graph entails a too high computational effort # use option "pairs" to build the graph # it is the default option for dimension >6 # cl < pdfCluster(x, graphtype="pairs", bwtype="adaptive") # summary(cl) # plot(cl) #comparison with original groups #table(groups(cl),gr) # a better classification is obtained with larger value of lambda # not necessary to run the whole procedure again # a pdfCluster method applies on pdfClusterclass objects! #cl1 < pdfCluster(cl, lambda=0.25) #table(gr, groups(cl1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.