consensusClustering: The consensusClustering function
In mpru/ConsensusClustering: An R Package for Consensus Clustering

Description Usage Arguments Details Value Examples

Runs Consensus Clustering for class discovery and clustering validation.

consensusClustering(dataMatrix, K = 2:3, nIters = 30, propSamples = 0.8,
  clusterAlgorithm = "Kmeans", verbose = TRUE, seed = NULL,
  saveResults = FALSE, pathOutput = "", finalLinkage = "average",
  PACLowerLim = 0.1, PACUpperLim = 0.9, plotHeatmaps = c("both",
  "consensus", "data", "no"), plotSave = c("no", "pdf", "bmp", "png", "ps"),
  showDendrogram = TRUE, showSamplesNames = TRUE,
  showFeaturesNames = TRUE, plotCDF = TRUE, plotTracking = TRUE,
  consensusStats = TRUE, consensusStatsPlots = TRUE)

`dataMatrix`	matrix or data frame with data to cluster, samples/items in the columns and features in the rows.
`K`	vector of integers representing numbeer of clusters to evaluate. It can be of length 1 and it does not need to consist of consecutive integers. For example, either of `K = 4`, `K = 2:5` or `K = c(5, 10, 15)` would work.
`nIters`	number of iterations (bootstrap samples).
`propSamples`	proportion of items to sample in each bootstrap sample.
`clusterAlgorithm`	algorithm to perform the clustering, for the moment only K-means is available.
`verbose`	logical, print progress messages to screen. During the bootstrap iterations, a report to monitor the progress is created in `pathOutput`.
`seed`	numerical value to set random seed for reproducible results. It uses `doRNG` package to guarantee reproducible results even when running in parallel.
`saveResults`	logical indicating if the output should be saved as an .rds file in the directory `pathOutput`.
`pathOutput`	directory for output files and iterations progress report, the current working directory by default.
`finalLinkage`	heirarchical linkage method for producing a final classification with the consensus indexes generated by the bootstrap samples.
`PACLowerLim`	lower limit for the interval of ambiguous clustering used for calculating PAC score, belongs to the interval (0, 1).
`PACUpperLim`	upper limit for the interval of ambiguous clustering used for calculating PAC score, belongs to the interval (0, 1).
`plotHeatmaps`	character string indicating which heatmaps should be produced: "consensus" (only heatmap of the consensus indexes), "data" (only heatmap of input data set), "both" (default), or "no" (no plot is produced).
`plotSave`	character string indicating the format the plot to be saved as files in directory `pathOutput`. Default is "no", the plots are not saved, but printed to the screen. Other options are: "pdf", "bmp", "png", "ps".
`showDendrogram`	logical indicating if dendrograms should be plotted in the heatmaps (defaults to TRUE).
`showSamplesNames`	logical indicating if sample names should be displayed in the plots (defaults to TRUE).
`showFeaturesNames`	logical indicating if features names should be displayed in the plots (defaults to TRUE).
`plotCDF`	logical indicating if the plot for the Cumulative Distribution Function (CDF) of the consensus indexes and for the relative change under the CDF should be produced. The second plot is not produced if `length(K) == 1`, since there is no comparison to be made. If `plotCDF == TRUE`, a vector with the area under the CDF curve for each K is returned.
`plotTracking`	logical indicating if the plot with the tracking of samples through different values of K should be produced. No tracking plot is produced if `length(K) == 1`, since there is no tracking to be done.
`consensusStats`	logical indicating if consensus statistics should be computed.
`consensusStatsPlots`	logical indicating if plots of consensus statistics should be produced (only considered if `consensusStats == TRUE`).

Consensus Clustering is a revised tool for implementing the methodology for class discovery and clustering validation, based off of 2003 Monti's paper. This method is used to find a consensus assignment across multiple runs of a clustering approach, allowing one to assess and validate the stability of the discovered clusters empirically. The objective of this method is to identify robust clusters in the context of genomic data, but is applicable for any unsupervised learning task.

This function is parallelizad under the unifying paradigm, so it will automatically detect clusters or cores registered by the user before hand or will run sequentially if no parallel capabilities are available. Reproducible results are guaranteed when running in parallel if a seed is provided, through the use of the doRNG package.

A list with the results of the consensus clustering. The first elements of the list correspond to each value of K evaluated, each one containing:

consensusTree: final heirarchical tree based on the matrix of consensus indexes after running all the iterations of the clustering.
consensusClass: vector with the final cluster assignment for each sample.
consensusVector: vector of consensus indexes for each pair of samples. The consensus index is the proportion of times that a pair of samples was clustered together in the same group, out of the total number times they were in the same bootstrap sample.

The following elements of the list returned are:

PACscores: vector with the Proportion of Ambiguous Clustering score (PAC) for each value of K evaluated. The PAC score is the fraction of sample pairs with consensus index values falling in the intermediate sub-interval (PACLowerLim, PACUpperLim), by default (0.1, 0.9). Lower PAC score is indicative of robustness.
colorTracking: list with clusters and samples color assignment for each K. If length(K) > 1 colors are assigned tracking samples across different values of K, letting the user track the history of clusters relative to earlier clusters.

The list returned may also include the following elements if the correspondent arguments were set to TRUE:

areaUnderCDF: if plotCDF == TRUE, vector with area under the CDF curve for each K, see PlotCDF.
consensusStats: if consensusStats == TRUE, list with consensus statistics, see ConsensusStatsAndPlots.