cc_test: Consensus Clustering Test

Description Usage Arguments Details Value References See Also Examples

View source: R/cc_test.R

Description

This function implements the Monte Carlo consensus clustering algorithm.

Usage

1
2
3
4
5
cc_test(dat, max_k = 3, ref_method = "pc_norm", B = 100, reps = 100,
  distance = "euclidean", cluster_alg = "hclust",
  hclust_method = "average", p_item = 0.8, p_feature = 1,
  wts_item = NULL, wts_feature = NULL, pac_window = c(0.1, 0.9),
  p_adj = NULL, seed = NULL, parallel = TRUE)

Arguments

dat

Probe by sample omic data matrix. Data should be filtered and normalized prior to analysis.

max_k

Integer specifying the maximum cluster number to evaluate. Default is max_k = 3, but a more reasonable rule of thumb is the square root of the sample size.

ref_method

How should null data be generated? Options include "pc_norm", "pc_unif", "cholesky", "range", and "permute". See Details.

B

Number of reference datasets to generate.

reps

Number of subsamples to draw for consensus clustering.

distance

Distance metric for clustering. Supports all methods available in dist and vegdist, as well as those implemented in the bioDist package.

cluster_alg

Clustering algorithm to implement. Currently supports hierarchical ("hclust"), k-means ("kmeans"), and k-medoids ("pam").

hclust_method

Method to use if cluster_alg = "hclust". See hclust. Will also be applied for clustering on consensus matrix output.

p_item

Proportion of items to include in each subsample.

p_feature

Proportion of features to include in each subsample.

wts_item

Optional vector of item weights.

wts_feature

Optional vector of feature weights.

pac_window

Lower and upper bounds for the consensus index sub-interval over which to calculate the PAC. Must be on (0, 1). See Details.

p_adj

Optional method for p-value adjustment. Supports all options available in p.adjust.

seed

Optional seed for reproducibility.

parallel

If a parallel backend is loaded and available, should the function use it? Highly advisable if hardware permits.

Details

cc_test provides a hypothesis testing framework for consensus clustering. It takes an input matrix dat, and generates B null datasets with similar properties but no sample-wise cluster structure. The consensus cluster algorithm is then run on each simulated matrix, with PAC scores stored for reference. The function then consensus clusters the actual input data, and PAC scores for each cluster number k are tested against their empirically estimated null distribution.

cc_test currently supports five methods for generating null datasets from a given input matrix:

The first two options use the data's true eigenvectors to preserve feature-wise covariance while scrambling sample-wise covariance. "pc_norm" tends to generate the most realistic null data, while Monte Carlo replicates generated via "pc_unif" converge more quickly to a true k of 1. Both methods are fast and stable when features outnumber samples. When samples outnumber features, ref_method defaults to "cholesky", which takes longer to compute, but is better suited for such cases. "range" and "permute" are included for convenience, but are not recommended since they do not preserve feature-wise covariance, which may bias results.

PAC stands for proportion of ambiguous clustering. To calculate the PAC for a given cluster number k, we first compute the consensus matrix via consensus clustering; then generate the empirical CDF curve for the lower triangle of that matrix; find CDF-values for the upper and lower bounds of the PAC window; and subtract the latter value from the former. Since the consensus matrix for a perfectly stable cluster would consist of just 1's and 0's, the ideal CDF curve is flat in the middle. The goal is therefore to minimize the PAC. See Senbabaoglu et al. (2014) for more details.

Value

A list with max_k elements. If null = TRUE, then the first item is a results data frame with columns for cluster number k, observed PAC score, expected PAC score, z-stability, standard error, and p-value. Adjusted p-values are also returned if p_adj is non-NULL.

If null = FALSE, then the first item is a results data frame with columns for cluster number k and PAC score.

Elements two through max_k are lists corresponding to the unique values of k, each containing the following three elements: the consensus matrix, tree, and cluster assignments for that k, as determined by consensus clustering.

References

Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2003). Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning, 52: 91-118.

Senbabaoglu, Y., Michailidis, G. & Li, J.Z. (2014). Critical limitations of consensus clustering in class discovery. Scientific Reports, 4:6207.

See Also

ConsensusClusterPlus

Examples

1
2
mat <- matrix(rnorm(1000 * 12), nrow = 1000, ncol = 12)
res <- cc_test(mat)

dswatson/M3C documentation built on May 21, 2019, 7:58 a.m.