bayesCC: Bayesian Consensus Clustering

View source: R/bayesCC.R

bayesCCR Documentation

Bayesian Consensus Clustering

Description

Bayesian Consensus Clustering

Usage

bayesCC(
  X,
  K = 2,
  a = 1,
  b = 1,
  IndivAlpha = FALSE,
  mu0 = list(),
  a0 = list(),
  b0 = list(),
  Concentration = 1,
  maxiter = 1000,
  ...
)

Arguments

X

a list of data matrices, each with D_i rows & N columns.

K

integer, maximum number of clusters for K-means

a

numeric, hyperparameter for Alpha ~ Beta(a, b)

b

numeric, hyperparameter for Alpha ~ Beta(a, b)

IndivAlpha

boolean, whether to fit individual random effects

mu0

list of initial mean parameters for the Normal-Gamma

a0

list of initial shape parameters for the Normal-Gamma

b0

list of initial rate parameters for the Normal-Gamma

Concentration

initial concentration parameter for Dirichlet process

maxiter

how many iterations of the MCMC sampler should be run?

Details

Reference: Lock EF and Dunson DB, "Bayesian Consensus Clustering", Bioinformatics, 29(20), 2013.

The output of bayesCC(...) has several pieces:

  • Alpha. the average adherence (by data source, if IndivAlpha==T).

  • AlphaBounds. the 95 percent credible interval for Alpha.

  • Cbest. the "hard" overall clustering, as a binary matrix.

  • Lbest. a list of the separate clusterings by data source.

  • AlphaVec. a vector of alpha values over MCMC draws to assess mixing.

Data matrices in X should have the same number of columns (one per subject), but may have different numbers of rows. If a subject is missing for a data source, a nice improvement would be to marginalize over the remaining columns, perhaps after determining their overall cluster membership(s). If a row is missing for a data source, k-NN imputation should suffice.

It would be nice to parallelize the runs over all candidate values for K. Similarly, PAM or NMF can be more robust than K-means in some situations. Expect the next point release of the package to support either or both.

Note that the first (maxiter / 2) iterations are used as burn-in for MCMC.

Implementation details are given in the PDF found at http://www.tc.umn.edu/~elock/software/BCC.pdf This is more extensive than the Bioinformatics paper.

FIXME (maybe): Might be nice to use PAM and/or NMF clustering instead of K-means.

FIXME (maybe): See if it's possible to do matrix completion aided by cluster assignments, for the case when entire columns are NA or mostly-NA (*cough* TARGET *cough*)

Value

a list with elements (Alpha, AlphaBounds, Cbest, Lbest, AlphaVec)

See Also

alphaStar

Examples


## Not run:  

  # try a few
  Ks <- 2:5
  names(Ks) <- paste0("K", Ks)

  # can take a while...
  data(BRCAData)
  runK <- function(k) bayesCC(BRCAData, K=k, IndivAlpha=T, maxiter=10000)
  Results <- mclapply(Ks, runK)
 
  # ?alphaStar
  alphaStarDist <- data.frame(lapply(Results, alphaStar))
  boxplot(alphaStarDist, main="Mean-adjusted adherence by K (optimal: K=3)")


## End(Not run)


ttriche/bayesCC documentation built on May 13, 2023, 11:48 a.m.