Description Details Note Author(s) References See Also
Consensus Clustering is a revised tool for implementing the methodology for class discovery and clustering validation, based off of 2003 Monti's paper. This method is used to find a consensus assignment across multiple runs of a clustering approach, allowing one to assess and validate the stability of the discovered clusters empirically. The objective of this method is to identify robust clusters in the context of genomic data, but is applicable for any unsupervised learning task.
This package was inspired by an existing package that addresses the same methodology by Wilerson (2010), ConsensusClusterPlus, but improving the implementation of the method in the following aspects:
Implementation of parallelization: Our package let the user take advantage of multiple cores or the power of computational clusters to perform the bootstrap iterations in a faster way.
Improved use of data structures: In order to have better memory efficiency, we replaced all symmetric consensus matrices between pairs of samples with consensus vectors which store the same data in smaller structures.
User-friendly source code: Our code was developed following good-practice style, with descriptive variable names and a clear separation of the different tasks. These characteristics, missing in the previous ConsensusClusterPlus package, contribute to maintainability, understandability, reuseability, debugability and extensibility of the code.
Functions for analysis of the results that can be called later independently of the main function: All the diagnosis
plots for assesment of optimal value of K, as well as the calculation of consensus statistics, can be obtained in the main execution of the
consensusClustering
function but can also be disabled and run individually later calling the respective functions with consensusClustering
results as input. This allows the user to choose whether to spend time and computational resources in these tasks or not.
More flexible options for plots: Heatmaps for big data sets can run into computational problems when they try to plot deep dendrograms or visualization issues when annotating samples and features names. We made these characteristics available to be defined by the user.
Implementation of PAC scores: Our package adds one extra measure to asses the optimal number of K, the Proportion of Ambiguously Clustering (PAC score, Senbabaoglu, 2014).
Intra and Inter Cluster Consensus summary: Our package returns single intra and inter cluster consensus coefficients for each value of K evaluated, allowing easy comparison.
Analysis is performed for any desired values of K: In our package the user can provide a vector with the desired values of K
to evaluate (for example, K = 4
, K = 2:5
, K = c(5, 10, 15)
), while in Wilkerson's package the analysis had to be
performed for all values between 2 and K, being K defined by the user.
Plots implemented with ggplot2
and ComplexHeatmap
packages: resulting in plots with nice appeareance.
This first version of our package only handles Kmeans as the clustering algorithm. Wilkersons's ConsensusClusterPlus package provides a wide range of other options.
Jessica Soto and Marcos Prunello
Monti, S et al (2003) Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning, 52, 91-118.
Wilkerson M and Hayes D (2010) ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics, 26, 1572-1573.
Senbabaoglu, Y et al (2014) Critical limitations of consensus clustering in class discovery. Scientific Reports, 4, Article number 6207.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.