README.md
In lukadw11/Clusty: Evaluate clustering using distance heat maps

Clusty

The clusty package is used for evaluating distance-based clustering with non-overlapping cluster membership in R programming language. Specifically, it was designed for assessing k-means clustering with distance matrix heat maps, as the objective function is based on distance. Effective clustering is thus where instances within clusters are significantly similar (small distance) and instances between clusters are significantly differentiated (large distance). Using a heat map of cluster distances, a highly ranked diagonal in the summaryheat function corresponds to strong intra-cluster homogeneity while highly ranked upper or lower squares in the triangle of the heat map corresponds to strong inter-cluster heterogeneity. The distance metrics used in this heat map can be extracted using bigextract. bigheat uses condensed instance vectors to visualize the clustering (i.e. intra-cluster homogeneity and inter-cluster heterogeneity) at the instance level. This provides a granular look at how well differentiated instances are within and between clusters and permits row reduction of large datasets into condensed instance vectors.

bigheat: Say you completed k-means clustering on a data-set with comparable numeric features. The funciton groups the instance vectors into k (you choose k) groups that minimize the within cluster sum of squares distance. The choice of k may be determined by a WCSS elbow plot prior to clustering to select k. However, such a choice of k provides no information as to the structure of the clustering that will result. How alike are instances within each cluster? How different are instances between clusters? Roughly what proportion of a cluster is "very similar", i.e. how consistent is the similarity within clusters? Bigheat produces a distance matrix heat map at the instance level to help answer these questions heuristically. The function can handle large data-sets by condensing instance vectors into aggregated summary instances (e.g. 10,000 customers with merge=10 produces 1,000 summarized vectors). This permits high dimensional visualizaiton. The output is a sexy visual of that can be used as a comprehensive heuristic evaluation tool or your new desktop background.

bigheat_samples

summaryheat: Summaryheat takes the underlying structure of the large distance matrix used to create bigheat to visually summarize the level of differentiation between clusters and similarity within clusters. Which clusters have instances that are the most similar (i.e. the smallest WCSS)? Which clusters are most differentiated (i.e. in comparing instances in clusters X and Y, how different are they overall?)? Summaryheat ranks the diagonal of the distance heat map by the most to least homogenous; rank 1 thus corresponds to the cluster with the most similar instance vectors. The upper left and lower right trianglular matrices in contrast are ranked by the level of heterogeneity or differentiation; rank 1 thus corresponds to the most significantly differentiated pair of clusters. This is useful because inter-cluster differentiation can't be directly inferred from intra-cluster similarity.

summaryheat_sample

bigextract: Each square within the distance heat map from summaryheat corresponds to a group of instance vectors. The metrics used to construct and compare these squares in summaryheat can be extracted using bigextract (e.g. mean distance between instance vectors, the number of instances being compared, etc.). While the previous two functions create heat maps, bigextract produces a data frame where each row corresponds to a square of the heat matrix from summaryheat and each column corresponds to distance metrics. The first observation correpsonds to the bottom left cluster square. The next observation corresponds to the second cluster comparison (i.e. C2-C1 or C1-C2).

summaryheatp

install.packages("devtools") install_github("lukadw11/clusty")

This package is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, version 3, as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details. A copy of the GNU General Public License, version 3, is available at http://www.r-project.org/Licenses/GPL-3

lukadw11/Clusty documentation built on May 21, 2019, 8:57 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com