distrsimilarity: Similarity of within-cluster distributions to normal and...

View source: R/cquality20.R

distrsimilarityR Documentation

Similarity of within-cluster distributions to normal and uniform

Description

Two measures of dissimilarity between the within-cluster distributions of a dataset and normal or uniform distribution. For the normal it's the Kolmogorov dissimilarity between the Mahalanobis distances to the center and a chi-squared distribution. For the uniform it is the Kolmogorov distance between the distance to the kth nearest neighbour and a Gamma distribution (this is based on Byers and Raftery (1998)). The clusterwise values are aggregated by weighting with the cluster sizes.

Usage

distrsimilarity(x,clustering,noisecluster = FALSE,
distribution=c("normal","uniform"),nnk=2,
largeisgood=FALSE,messages=FALSE)

Arguments

x

the data matrix; a numerical object which can be coerced to a matrix.

clustering

integer vector of class numbers; length must equal nrow(x), numbers must go from 1 to the number of clusters.

noisecluster

logical. If TRUE, the cluster with the largest number is ignored for the computations.

distribution

vector of "normal", "uniform" or both. Indicates which of the two dissimilarities is/are computed.

nnk

integer. Number of nearest neighbors to use for dissimilarity to the uniform.

largeisgood

logical. If TRUE, dissimilarities are transformed to 1-d (this means that larger values indicate a better fit).

messages

logical. If TRUE, warnings are given if within-cluster covariance matrices are not invertible (in which case all within-cluster Mahalanobis distances are set to zero).

Value

List with the following components

kdnorm

Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).

kdunif

Kolmogorov distance between distribution of distances to nnkth nearest within-cluster neighbor and appropriate Gamma-distribution, see Byers and Raftery (1998), aggregated over clusters.

kdnormc

vector of cluster-wise Kolmogorov distances between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution.

kdunifc

vector of cluster-wise Kolmogorov distances between distribution of distances to nnkth nearest within-cluster neighbor and appropriate Gamma-distribution.

xmahal

vector of Mahalanobs distances to the respective cluster center.

xdknn

vector of distance to nnkth nearest within-cluster neighbor.

Note

It is very hard to capture similarity to a multivariate normal or uniform in a single value, and both used here have their shortcomings. Particularly, the dissimilarity to the uniform can still indicate a good fit if there are holes or it's a uniform distribution concentrated on several not connected sets.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Byers, S. and Raftery, A. E. (1998) Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.

Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282

See Also

cqcluster.stats,cluster.stats for more cluster validity statistics.

Examples

  set.seed(20000)
  options(digits=3)
  face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
  km3 <- kmeans(face,3)
  distrsimilarity(face,km3$cluster) 

fpc documentation built on Sept. 24, 2024, 9:07 a.m.