| bayesCC | R Documentation |
Bayesian Consensus Clustering
bayesCC(
X,
K = 2,
a = 1,
b = 1,
IndivAlpha = FALSE,
mu0 = list(),
a0 = list(),
b0 = list(),
Concentration = 1,
maxiter = 1000,
...
)
X |
a list of data matrices, each with D_i rows & N columns. |
K |
integer, maximum number of clusters for K-means |
a |
numeric, hyperparameter for Alpha ~ Beta(a, b) |
b |
numeric, hyperparameter for Alpha ~ Beta(a, b) |
IndivAlpha |
boolean, whether to fit individual random effects |
mu0 |
list of initial mean parameters for the Normal-Gamma |
a0 |
list of initial shape parameters for the Normal-Gamma |
b0 |
list of initial rate parameters for the Normal-Gamma |
Concentration |
initial concentration parameter for Dirichlet process |
maxiter |
how many iterations of the MCMC sampler should be run? |
Reference: Lock EF and Dunson DB, "Bayesian Consensus Clustering", Bioinformatics, 29(20), 2013.
The output of bayesCC(...) has several pieces:
Alpha. the average adherence (by data source, if IndivAlpha==T).
AlphaBounds. the 95 percent credible interval for Alpha.
Cbest. the "hard" overall clustering, as a binary matrix.
Lbest. a list of the separate clusterings by data source.
AlphaVec. a vector of alpha values over MCMC draws to assess mixing.
Data matrices in X should have the same number of columns (one per subject), but may have different numbers of rows. If a subject is missing for a data source, a nice improvement would be to marginalize over the remaining columns, perhaps after determining their overall cluster membership(s). If a row is missing for a data source, k-NN imputation should suffice.
It would be nice to parallelize the runs over all candidate values for K. Similarly, PAM or NMF can be more robust than K-means in some situations. Expect the next point release of the package to support either or both.
Note that the first (maxiter / 2) iterations are used as burn-in for MCMC.
Implementation details are given in the PDF found at http://www.tc.umn.edu/~elock/software/BCC.pdf This is more extensive than the Bioinformatics paper.
FIXME (maybe): Might be nice to use PAM and/or NMF clustering instead of K-means.
FIXME (maybe): See if it's possible to do matrix completion aided by cluster assignments, for the case when entire columns are NA or mostly-NA (*cough* TARGET *cough*)
a list with elements (Alpha, AlphaBounds, Cbest, Lbest, AlphaVec)
alphaStar
## Not run:
# try a few
Ks <- 2:5
names(Ks) <- paste0("K", Ks)
# can take a while...
data(BRCAData)
runK <- function(k) bayesCC(BRCAData, K=k, IndivAlpha=T, maxiter=10000)
Results <- mclapply(Ks, runK)
# ?alphaStar
alphaStarDist <- data.frame(lapply(Results, alphaStar))
boxplot(alphaStarDist, main="Mean-adjusted adherence by K (optimal: K=3)")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.