PlotCDF: The PlotCDF function

Description Usage Arguments Details Value Examples

Description

Plots the Cumulative Distribution Function (CDF) of the consensus indexes for assessment of the optimal number K, while displaying PAC scores in the legend. Also plots the relative change under the CDF curve, only if results were produced for more than one value of K, so that the comparison can be made.

Usage

1
2
PlotCDF(results, plotSave = c("no", "pdf", "bmp", "png", "ps"),
  pathOutput = "", PACLowerLim = 0.1, PACUpperLim = 0.9)

Arguments

results

output from consensusClustering function.

plotSave

character string indicating the format the plot to be saved as files in directory pathOutput. Default is "no", the plots are not saved, but printed to the screen. Other options are: "pdf", "bmp", "png", "ps".

pathOutput

directory for saving plots if plotSave == TRUE, defaults to current working directory.

PACLowerLim

lower limit for the interval of ambiguous clustering used for calculating PAC score.

PACUpperLim

upper limit for the interval of ambiguous clustering used for calculating PAC score.

Details

The CDF plot shows the cumulative distribution functions of the consensus indexes for all pairs of samples for each k (indicated by colors). The empirical CDF plot holds the cumulative distribution function (CDF) values on the y and the consensus index values on the x-axis. In the CDF curve, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle portion represents those with occasional co-assignments in different clustering runs. Tthe CDF curves show a flat middle segment for the true K, suggesting that very few sample pairs are ambiguous when K is correctly inferred.

The PAC score can be used to quantify this characteristic. The Proportion of Ambiguous Clustering (PAC) is the fraction of sample pairs that hold consensus index values within a given sub-interval (x1, x2) in [0,1] (usually, x1 = 0.1 and x2 = 0.9). The CDF values correspond to the fraction of sample pairs with a consensus index values less or equal to the value 'c'. The PAC is then calculated by CDF(x2) - CDF(x1), optimal K should present a low PAC score.

The difference between the two CDFs can be partially summarized by measuring the area under the two curves. The plot of the relative change in the area under the CDF curve allows a user to determine the relative increase in consensus and determine the value of K at which there is no appreciable increase.

The relative change in the area under the CDF curve for a given value k is defined as (area(k') - area(k))/area(k) where k' is the consecutive value to k in the input vector K. For the first value in the vector K, the relative change is just the area under the curve since there is no previous value to compute the relative change. The vector K provided by the user is always sorted internally, so even if it does not contain consecutive integers, the relative change under the CDF curve between one value of k to the next one still can be meaningful.

Value

A vector with the area under the CDF curve for each value of K (note, not the relative change under the area).

Examples

1
2
3
mat <- matrix(rnorm(10*6), 10, 6)
result <- consensusClustering(mat, K=2:3, nIters = 5, plotCDF = FALSE)
PlotCDF(result)

mpru/ConsensusClustering documentation built on May 9, 2019, 5:54 a.m.