cvi | R Documentation |
Compute different cluster validity indices (CVIs) of a given cluster partition, using the clustering distance measure and centroid function if applicable.
cvi(a, b = NULL, type = "valid", ..., log.base = 10) ## S4 method for signature 'matrix' cvi(a, b = NULL, type = "valid", ..., log.base = 10) ## S4 method for signature 'PartitionalTSClusters' cvi(a, b = NULL, type = "valid", ..., log.base = 10) ## S4 method for signature 'HierarchicalTSClusters' cvi(a, b = NULL, type = "valid", ..., log.base = 10) ## S4 method for signature 'FuzzyTSClusters' cvi(a, b = NULL, type = "valid", ..., log.base = 10)
a |
An object returned by |
b |
If needed, a vector that can be coerced to integers which indicate the cluster memberships. The ground truth (if known) should be provided here. |
type |
Character vector indicating which indices are to be computed. See supported values below. |
... |
Arguments to pass to and from other methods. |
log.base |
Base of the logarithm to be used in the calculation of VI (see details). |
Clustering is commonly considered to be an unsupervised procedure, so evaluating its performance can be rather subjective. However, a great amount of effort has been invested in trying to standardize cluster evaluation metrics by using cluster validity indices (CVIs).
In general, CVIs can be either tailored to crisp or fuzzy partitions. CVIs can be classified as internal, external or relative depending on how they are computed. Focusing on the first two, the crucial difference is that internal CVIs only consider the partitioned data and try to define a measure of cluster purity, whereas external CVIs compare the obtained partition to the correct one. Thus, external CVIs can only be used if the ground truth is known.
Note that even though a fuzzy partition can be changed into a crisp one, making it compatible with many of the existing crisp CVIs, there are also fuzzy CVIs tailored specifically to fuzzy clustering, and these may be more suitable in those situations. Fuzzy partitions usually have no ground truth associated with them, but there are exceptions depending on the task's goal.
Each index defines their range of values and whether they are to be minimized or maximized. In many cases, these CVIs can be used to evaluate the result of a clustering algorithm regardless of how the clustering works internally, or how the partition came to be.
Knowing which CVI will work best cannot be determined a priori, so they should be tested for each specific application. Usually, many CVIs are utilized and compared to each other, maybe using a majority vote to decide on a final result. Furthermore, it should be noted that many CVIs perform additional distance calculations when being computed, which can be very considerable if using DTW or GAK.
The chosen CVIs.
Crisp partitions (the first 4 are calculated via flexclust::comPart()
)
"RI"
: Rand Index (to be maximized).
"ARI"
: Adjusted Rand Index (to be maximized).
"J"
: Jaccard Index (to be maximized).
"FM"
: Fowlkes-Mallows (to be maximized).
"VI"
: Variation of Information (Meila (2003); to be minimized).
Fuzzy partitions (based on Lei et al. (2017))
"RI"
: Soft Rand Index (to be maximized).
"ARI"
: Soft Adjusted Rand Index (to be maximized).
"VI"
: Soft Variation of Information (to be minimized).
"NMIM"
: Soft Normalized Mutual Information based on Max entropy (to be maximized).
The indices marked with an exclamation mark (!) calculate (or re-use if already available) the whole distance matrix between the series in the data. If you were trying to avoid this in the first place, then these CVIs might not be suitable for your application.
The indices marked with a question mark (?) depend on the extracted centroids, so bear that in
mind if a hierarchical procedure was used and/or the centroid function has associated
randomness (such as shape_extraction()
with series of different length).
The indices marked with a tilde (~) require the calculation of a global centroid. Since DBA()
and shape_extraction()
(for series of different length) have some randomness associated,
these indices might not be appropriate for those centroids.
Crisp partitions
"Sil"
(!): Silhouette index (Rousseeuw (1987); to be maximized).
"D"
(!): Dunn index (Arbelaitz et al. (2013); to be maximized).
"COP"
(!): COP index (Arbelaitz et al. (2013); to be minimized).
"DB"
(?): Davies-Bouldin index (Arbelaitz et al. (2013); to be minimized).
"DBstar"
(?): Modified Davies-Bouldin index (DB*) (Kim and Ramakrishna (2005); to be
minimized).
"CH"
(~): Calinski-Harabasz index (Arbelaitz et al. (2013); to be maximized).
"SF"
(~): Score Function (Saitta et al. (2007); to be maximized; see notes).
Fuzzy partitions (using the nomenclature from Wang and Zhang (2007))
"MPC"
: to be maximized.
"K"
(~): to be minimized.
"T"
: to be minimized.
"SC"
(~): to be maximized.
"PBMF"
(~): to be maximized (see notes).
"valid"
: Returns all valid indices depending on the type of a
and whether b
was
provided or not.
"internal"
: Returns all internal CVIs. Only supported for TSClusters objects.
"external"
: Returns all external CVIs. Requires b
to be provided.
In the original definition of many internal and fuzzy CVIs, the Euclidean distance and a mean centroid was used. The implementations here change this, making use of whatever distance/centroid was chosen during clustering. However, some of the CVIs assume that the distances are symmetric, since cross-distance matrices are calculated and only the upper/lower triangulars are considered. A warning will be given if the matrices are not symmetric and the CVI assumes so.
Because of the above, calculating CVIs for clusterings made with TADPole()
is a special case.
Since TADPole uses 3 distances during its execution (DTW, LB_Keogh and Euclidean), it is not
obvious which one should be used for the calculation of CVIs. Nevertheless, dtw_basic()
is used
by default.
The formula for the SF index in Saitta et al. (2007) does not correspond to the one in Arbelaitz et al. (2013). The one specified in the former is used here.
The formulas for the Silhouette index are not entirely correct in Arbelaitz et al. (2013), refer to Rousseeuw (1987) for the correct ones.
The formulas for the PBMF index are not entirely unambiguous in the literature, the ones given in Lin (2013) are used here.
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Perez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243-256.
Kim, M., & Ramakrishna, R. S. (2005). New indices for cluster validity assessment. Pattern Recognition Letters, 26(15), 2353-2363.
Lei, Y., Bezdek, J. C., Chan, J., Vinh, N. X., Romano, S., & Bailey, J. (2017). Extending information-theoretic validity indices for fuzzy clustering. IEEE Transactions on Fuzzy Systems, 25(4), 1013-1018.
Lin, H. Y. (2013). Effective Feature Selection for Multi-class Classification Models. In Proceedings of the World Congress on Engineering (Vol. 3).
Meila, M. (2003). Comparing clusterings by the variation of information. In Learning theory and kernel machines (pp. 173-187). Springer Berlin Heidelberg.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.
Saitta, S., Raphael, B., & Smith, I. F. (2007). A bounded index for cluster validity. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 174-187). Springer Berlin Heidelberg.
Wang, W., & Zhang, Y. (2007). On fuzzy cluster validity indices. Fuzzy sets and systems, 158(19), 2095-2117.
cvi(CharTrajLabels, sample(CharTrajLabels), type = c("ARI", "VI"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.