Description Details Author(s) References See Also

Fully Bayesian inference for estimating the number of clusters and related parameters to heterogeneous binary data.

This package can be used in order to cluster multivariate binary data (NAs are allowed). The main function of the package is `coupledMetropolis`

.

The input is an *n \times d* binary array where *n* and *d* denote the number of observations and dimension of the data. The underlying model is a mixture of independent multivariate Bernoulli distributions with an unknown number of components:

*x_i\sim∑_{k=1}^{K}π_k∏_{j=1}^{d}f(x_{ij};θ_{kj}),*

with *x_i = (x_{i1},…,x_{id})*; *d>1*, independent for *i = 1,…,n*. The term *f(x_{ij};θ_{kj})* denotes the probability density function of the Bernoulli distribution with parameter *θ_{kj}\in(0,1)*. The number of clusters *K* is a random variable with support *\{1,…,K_{\mbox{max}}\}*, where *K_{max}* is an upper bound for the number of clusters. The model uses the following prior assumptions:

*K\sim \mbox{discrete}\{1,…,K_{\mbox{max}}\}*

*(π_1,…,π_K)|K \sim \mbox{Dirichlet}(γ,…,γ)*

*θ_{kj}|K \sim \mbox{Beta}(α,β);\quad \mbox{independent for}\quad k = 1,…,K; j =1,…,d.*

The discrete distribution on the number of clusters it can be a truncated Poisson(1) or Uniform distribution. The model uses data augmentation by also considering the (latent) allocation variable *z_i*, which a priori assigns observation *i* to cluster *k = 1,…,K* with probability *P(z_i = k|K, π_1,…,π_K) = π_k* independently for *i=1,…,n*.

In order to infer the parameters of the model, a Markov chain Monte Carlo (MCMC) approach is adopted. Given *K*, the component-specific parameters *π_k* and *θ_{kj}* are integrated out and a collapsed allocation sampler which also updates the number of clusters (Nobile and Fearnside, 2007) is implemented. In the case that the observed data contains missing values, the algorithm simulates their values from the corresponding full conditional distribution. In order to improve the mixing of the simulated chain, a Metropolis-coupled MCMC sampler (Altekar et al., 2004) is incorporated. In particular, various heated chains are run in parallel and swaps are proposed between pairs of chains. The number of chains should be equal to the number of available cores. Each chain runs in parallel using the packages `foreach`

and `doParallel`

.

After inferring the most probable number of clusters, the simulated parameters which correspond to this specific value of *K* are post-processed in order to undo the label switching problem. For this purpose the `label.switching`

package (Papastamoulis, 2016; see also Papastamoulis and Iliopoulos 2010, 2013 and Papastamoulis, 2014) is used.

Panagiotis Papastamoulis

Maintainer: Panagiotis Papastamoulis

Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. (2004): Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3): 407-415.

Nobile A and Fearnside A (2007): Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing, 17(2): 147-162.

Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.

Papastamoulis P. and Iliopoulos G. (2013). On the convergence rate of Random Permutation Sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability, 15(2): 293-304.

Papastamoulis P. (2014). Handling the label switching problem in latent class models via the ECR algorithm. Communications in Statistics, Simulation and Computation, 43(4): 913-927.

Papastamoulis P (2016): label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1): 1-24.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.