Description Usage Arguments Details Value References See Also Examples
This function implements the Monte Carlo consensus clustering algorithm.
1 2 3 4 5 |
dat |
Probe by sample omic data matrix. Data should be filtered and normalized prior to analysis. |
max_k |
Integer specifying the maximum cluster number to evaluate.
Default is |
ref_method |
How should null data be generated? Options include |
B |
Number of reference datasets to generate. |
reps |
Number of subsamples to draw for consensus clustering. |
distance |
Distance metric for clustering. Supports all methods
available in |
cluster_alg |
Clustering algorithm to implement. Currently supports
hierarchical ( |
hclust_method |
Method to use if |
p_item |
Proportion of items to include in each subsample. |
p_feature |
Proportion of features to include in each subsample. |
wts_item |
Optional vector of item weights. |
wts_feature |
Optional vector of feature weights. |
pac_window |
Lower and upper bounds for the consensus index sub-interval over which to calculate the PAC. Must be on (0, 1). See Details. |
p_adj |
Optional method for p-value adjustment. Supports all
options available in |
seed |
Optional seed for reproducibility. |
parallel |
If a parallel backend is loaded and available, should the function use it? Highly advisable if hardware permits. |
cc_test
provides a hypothesis testing framework for consensus
clustering. It takes an input matrix dat
, and generates B
null
datasets with similar properties but no sample-wise cluster structure. The
consensus cluster algorithm is then run on each simulated matrix, with PAC
scores stored for reference. The function then consensus clusters the actual
input data, and PAC scores for each cluster number k are tested
against their empirically estimated null distribution.
cc_test
currently supports five methods for generating null datasets
from a given input matrix:
"pc_norm"
simulates the principal components by taking random
draws from a normal distribution with variance equal to the true
eigenvalues. Data are subsequently back-transformed to their original
dimensions by cross-multiplication with the true eigenvector matrix.
"pc_unif"
simulates the principal components by taking random
draws from a uniform distribution with ranges equal to those of the true
principal components. Data are subsequently back-transformed to their
original dimensions by cross-multiplication with the true eigenvector
matrix.
"cholesky"
simulates random Gaussian noise around the nearest
positive-definite approximation to dat
's feature-wise covariance
matrix.
"range"
selects random values uniformly from each feature's
observed range.
"permute"
shuffles each feature's observed values.
The first two options use the data's true eigenvectors to preserve
feature-wise covariance while scrambling sample-wise covariance.
"pc_norm"
tends to generate the most realistic null data, while Monte Carlo
replicates generated via "pc_unif"
converge more quickly to a true
k of 1. Both methods are fast and stable when features outnumber
samples. When samples outnumber features, ref_method
defaults to
"cholesky"
, which takes longer to compute, but is better suited for
such cases. "range"
and "permute"
are included for convenience,
but are not recommended since they do not preserve feature-wise covariance,
which may bias results.
PAC stands for proportion of ambiguous clustering. To calculate the PAC for a given cluster number k, we first compute the consensus matrix via consensus clustering; then generate the empirical CDF curve for the lower triangle of that matrix; find CDF-values for the upper and lower bounds of the PAC window; and subtract the latter value from the former. Since the consensus matrix for a perfectly stable cluster would consist of just 1's and 0's, the ideal CDF curve is flat in the middle. The goal is therefore to minimize the PAC. See Senbabaoglu et al. (2014) for more details.
A list with max_k
elements. If null = TRUE
, then
the first item is a results data frame with columns for cluster number
k, observed PAC score, expected PAC score, z-stability, standard
error, and p-value. Adjusted p-values are also returned if
p_adj
is non-NULL
.
If null = FALSE
, then the first item is a results data frame
with columns for cluster number k and PAC score.
Elements two through max_k
are lists corresponding to the unique
values of k, each containing the following three elements: the
consensus matrix, tree, and cluster assignments for that k, as
determined by consensus clustering.
Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2003). Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning, 52: 91-118.
Senbabaoglu, Y., Michailidis, G. & Li, J.Z. (2014). Critical limitations of consensus clustering in class discovery. Scientific Reports, 4:6207.
1 2 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.