bayesCC: Bayesian Consensus Clustering
In ttriche/bayesCC: Bayesian consensus clustering

View source: R/bayesCC.R

bayesCC

R Documentation

Bayesian Consensus Clustering

Description

Bayesian Consensus Clustering

Usage

bayesCC(
  X,
  K = 2,
  a = 1,
  b = 1,
  IndivAlpha = FALSE,
  mu0 = list(),
  a0 = list(),
  b0 = list(),
  Concentration = 1,
  maxiter = 1000,
  ...
)

Arguments

`X`	a list of data matrices, each with D_i rows & N columns.
`K`	integer, maximum number of clusters for K-means
`a`	numeric, hyperparameter for Alpha ~ Beta(a, b)
`b`	numeric, hyperparameter for Alpha ~ Beta(a, b)
`IndivAlpha`	boolean, whether to fit individual random effects
`mu0`	list of initial mean parameters for the Normal-Gamma
`a0`	list of initial shape parameters for the Normal-Gamma
`b0`	list of initial rate parameters for the Normal-Gamma
`Concentration`	initial concentration parameter for Dirichlet process
`maxiter`	how many iterations of the MCMC sampler should be run?

Details

Reference: Lock EF and Dunson DB, "Bayesian Consensus Clustering", Bioinformatics, 29(20), 2013.

The output of bayesCC(...) has several pieces:

Alpha. the average adherence (by data source, if IndivAlpha==T).
AlphaBounds. the 95 percent credible interval for Alpha.
Cbest. the "hard" overall clustering, as a binary matrix.
Lbest. a list of the separate clusterings by data source.
AlphaVec. a vector of alpha values over MCMC draws to assess mixing.

Data matrices in X should have the same number of columns (one per subject), but may have different numbers of rows. If a subject is missing for a data source, a nice improvement would be to marginalize over the remaining columns, perhaps after determining their overall cluster membership(s). If a row is missing for a data source, k-NN imputation should suffice.

It would be nice to parallelize the runs over all candidate values for K. Similarly, PAM or NMF can be more robust than K-means in some situations. Expect the next point release of the package to support either or both.

Note that the first (maxiter / 2) iterations are used as burn-in for MCMC.

Implementation details are given in the PDF found at http://www.tc.umn.edu/~elock/software/BCC.pdf This is more extensive than the Bioinformatics paper.

FIXME (maybe): Might be nice to use PAM and/or NMF clustering instead of K-means.

FIXME (maybe): See if it's possible to do matrix completion aided by cluster assignments, for the case when entire columns are NA or mostly-NA (*cough* TARGET *cough*)

Value

a list with elements (Alpha, AlphaBounds, Cbest, Lbest, AlphaVec)

Examples


## Not run:  

  # try a few
  Ks <- 2:5
  names(Ks) <- paste0("K", Ks)

  # can take a while...
  data(BRCAData)
  runK <- function(k) bayesCC(BRCAData, K=k, IndivAlpha=T, maxiter=10000)
  Results <- mclapply(Ks, runK)
 
  # ?alphaStar
  alphaStarDist <- data.frame(lapply(Results, alphaStar))
  boxplot(alphaStarDist, main="Mean-adjusted adherence by K (optimal: K=3)")


## End(Not run)

ttriche/bayesCC documentation built on May 13, 2023, 11:48 a.m.