cc_test: Consensus Clustering Test
In dswatson/M3C: Rigorously test cluster stability

Description Usage Arguments Details Value References See Also Examples

View source: R/cc_test.R

This function implements the Monte Carlo consensus clustering algorithm.

cc_test(dat, max_k = 3, ref_method = "pc_norm", B = 100, reps = 100,
  distance = "euclidean", cluster_alg = "hclust",
  hclust_method = "average", p_item = 0.8, p_feature = 1,
  wts_item = NULL, wts_feature = NULL, pac_window = c(0.1, 0.9),
  p_adj = NULL, seed = NULL, parallel = TRUE)

`dat`	Probe by sample omic data matrix. Data should be filtered and normalized prior to analysis.
`max_k`	Integer specifying the maximum cluster number to evaluate. Default is `max_k = 3`, but a more reasonable rule of thumb is the square root of the sample size.
`ref_method`	How should null data be generated? Options include `"pc_norm"`, `"pc_unif"`, `"cholesky"`, `"range"`, and `"permute"`. See Details.
`B`	Number of reference datasets to generate.
`reps`	Number of subsamples to draw for consensus clustering.
`distance`	Distance metric for clustering. Supports all methods available in `dist` and `vegdist`, as well as those implemented in the `bioDist` package.
`cluster_alg`	Clustering algorithm to implement. Currently supports hierarchical (`"hclust"`), k-means (`"kmeans"`), and k-medoids (`"pam"`).
`hclust_method`	Method to use if `cluster_alg = "hclust"`. See `hclust`. Will also be applied for clustering on consensus matrix output.
`p_item`	Proportion of items to include in each subsample.
`p_feature`	Proportion of features to include in each subsample.
`wts_item`	Optional vector of item weights.
`wts_feature`	Optional vector of feature weights.
`pac_window`	Lower and upper bounds for the consensus index sub-interval over which to calculate the PAC. Must be on (0, 1). See Details.
`p_adj`	Optional method for p-value adjustment. Supports all options available in `p.adjust`.
`seed`	Optional seed for reproducibility.
`parallel`	If a parallel backend is loaded and available, should the function use it? Highly advisable if hardware permits.

cc_test provides a hypothesis testing framework for consensus clustering. It takes an input matrix dat, and generates B null datasets with similar properties but no sample-wise cluster structure. The consensus cluster algorithm is then run on each simulated matrix, with PAC scores stored for reference. The function then consensus clusters the actual input data, and PAC scores for each cluster number k are tested against their empirically estimated null distribution.

cc_test currently supports five methods for generating null datasets from a given input matrix:

"pc_norm" simulates the principal components by taking random draws from a normal distribution with variance equal to the true eigenvalues. Data are subsequently back-transformed to their original dimensions by cross-multiplication with the true eigenvector matrix.
"pc_unif" simulates the principal components by taking random draws from a uniform distribution with ranges equal to those of the true principal components. Data are subsequently back-transformed to their original dimensions by cross-multiplication with the true eigenvector matrix.
"cholesky" simulates random Gaussian noise around the nearest positive-definite approximation to dat's feature-wise covariance matrix.
"range" selects random values uniformly from each feature's observed range.
"permute" shuffles each feature's observed values.

The first two options use the data's true eigenvectors to preserve feature-wise covariance while scrambling sample-wise covariance. "pc_norm" tends to generate the most realistic null data, while Monte Carlo replicates generated via "pc_unif" converge more quickly to a true k of 1. Both methods are fast and stable when features outnumber samples. When samples outnumber features, ref_method defaults to "cholesky", which takes longer to compute, but is better suited for such cases. "range" and "permute" are included for convenience, but are not recommended since they do not preserve feature-wise covariance, which may bias results.

PAC stands for proportion of ambiguous clustering. To calculate the PAC for a given cluster number k, we first compute the consensus matrix via consensus clustering; then generate the empirical CDF curve for the lower triangle of that matrix; find CDF-values for the upper and lower bounds of the PAC window; and subtract the latter value from the former. Since the consensus matrix for a perfectly stable cluster would consist of just 1's and 0's, the ideal CDF curve is flat in the middle. The goal is therefore to minimize the PAC. See Senbabaoglu et al. (2014) for more details.

A list with max_k elements. If null = TRUE, then the first item is a results data frame with columns for cluster number k, observed PAC score, expected PAC score, z-stability, standard error, and p-value. Adjusted p-values are also returned if p_adj is non-NULL.

If null = FALSE, then the first item is a results data frame with columns for cluster number k and PAC score.

Elements two through max_k are lists corresponding to the unique values of k, each containing the following three elements: the consensus matrix, tree, and cluster assignments for that k, as determined by consensus clustering.

Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2003). Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning, 52: 91-118.

Senbabaoglu, Y., Michailidis, G. & Li, J.Z. (2014). Critical limitations of consensus clustering in class discovery. Scientific Reports, 4:6207.

ConsensusClusterPlus