clValid_flow: Cluster validation measure analysis workflow

View source: R/clValid.R

clValid_flowR Documentation

Cluster validation measure analysis workflow

Description

Interactive console workflow to calculate and evaluate cluster validation measures which have been determined previously by the call init_clValid.

Usage

clValid_flow(matrix, par)

Arguments

matrix

Earth Mover's Distance Matrix for processed patient time series data (also see functions: emd_matrix, patient_list)

par

Object of type list storing clustering methods and cluster range of interest; initialized via function: init_clValid

Details

The call guides through an interactive workflow and generates cluster evaluation measures, stores and lists, visualizes corresponding plots and lets the user decide which technique is the prefered one. Once the user has chosen his favourite, the flow continues to the function clust_matrix and generates the respective clustering output. The internal cluster validation methods utilize just the dataset and the clustering partition as input and evaluates the clustering’s quality by using intrinsic information included in the data.

The call calculates Connectivity, Silhouette width and Dunn index. Connectivity describes the connectness to neighbors of particular clustering partition and should be minimized. Silhouette width defines the average silhouette value for each observation and should be maximized. The Dunnn index is a definition for Ratio of shortest distance between non-cluster observation and greatest intra-cluster distance and should be maximized likewise.

Furthermore, cluster stability measures are available, namely Average proportion of non-overlap (APN), Average distance (AD), Average distance between means (ADM) and Figure of merit (FOM). APN is the average proportion of observations that are not clustered using complete and leaky data. AD defines the average distance in observations for both complete and leaky data. ADM deals with the average distance between cluster centers in complete and leaky data. FOM is a measure for average intra- cluster variance in leaky data. All measures should be minimized. Furthermore, Rank Aggregation may be performed. It approaches to provide a generic and flexible framework for objectively integrating several ordered lists in a suitable and efficient way. The used technique for evaluating clustering the rank is by a Cross-entropy approach, which is incorporating Spearman’s footrule distance measure. In the end a recommendation for the best fitting clustering model is given.

Value

Object of type list storing chosen clustering method and number of clusters (can be then used for function clust_matrix)

References

Guy Brock, Vasyl Pihur, Susmita Datta, and Somnath Datta. clvalid: An r package for cluster validation. Journal of Statistical Software, 25:1–22, 2008.

Julia Handl, Joshua Knowles, and Douglas B Kell. Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15):3201–3212, 2005.

Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.

Joseph C Dunn. Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 4(1):95–104, 1974.

Vasyl Pihur, Susmita Datta, and Somnath Datta. Weighted rank aggregation of cluster validation measures: a monte carlo cross-entropy approach. Bioinformatics, 23(13):1607–1615, 2007.

Examples

list <- patient_list(
"https://raw.githubusercontent.com/MrMaximumMax/FBCanalysis/master/demo/phys/data.csv",
GitHub = TRUE)
#Sampling frequency is supposed to be daily
distmat <- emd_matrix(list, "PEF", maxIter = 5000)
parameters <- init_clValid()
output <- clValid_flow(distmat, parameters)


MrMaximumMax/FBCanalysis documentation built on June 23, 2022, 8:21 p.m.