Dealing with label switching: relabelling in Bayesian mixture models by pivotal units"
In pivmet: Pivotal Methods for Bayesian Relabelling and k-Means Clustering

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this vignette we explore the fit of Gaussian mixture models and the relabelling pivotal method proposed in @egidi2018relabelling through the pivmet package. First of all, we load the package:

library(pivmet)
library(bayesmix)
library(bayesplot)

The pivmet \code{R} package provides a simple framework to (i) fit univariate and multivariate mixture models according to a Bayesian flavour and detect the pivotal units, via the piv_MCMC function; (ii) perform the relabelling step via the piv_rel function.

There are two main functions used for this task.

The function piv_MCMC:

performs MCMC sampling for Gaussian mixture models using the underlying rjags or rstan packages (chosen by the users through the optional function argument software, by default set to rjags). Precisely the package uses: the JAGSrun function of the bayesmix package for univariate mixture models; the run.jags function of the runjags package for bivariate mixture models; the stan function of the rstan package for both univariate and bivariate mixture models.
Post-processes the chains and randomly switches their values [@fruhwirth2001markov].
Builds a co-association matrix for the $N$ units. After $H$ MCMC iterations, the function implements a clustering procedure with $k$ groups, the user may choose among agglomerative or divisive hierarchical clustering through the optional argument clustering. Using the latent formulation for mixture models, we denote with $[z_i]h$ the group allocation of the $i$-th unit at the $h$-th iteration. In such a way, a co-association matrix $C$ for the $N$ statistical units is built, where the single cell $c{ip}$ is the fraction of times the unit $i$ and the unit $p$ belong to the same group along the $H$ MCMC iterations:

$$c_{ip} = \frac{n_{ip}}{H}=\frac{1}{H} \sum_{h=1}^{H}|[z_i]h=[z_p]_h|,$$ where $|\cdot|$ denotes the event indicator and $n{ip}$ is the number of times the units $i, \ p$ belong to the same group along the sampling.

Extracts the pivots, one for each group, which are (pairwise) separated units with (posterior) probability one (that is, the posterior probability of any two of them being in the same group is approximately zero). We denote them by $i_1,\ldots,i_k$. The user may choose among four procedures for extracting the pivotal units with the optional argument piv.criterion. For group $j$ containing $J_j$ units, one can choose:
$i^{*}$ that maximizes $\sum_{p \in J_j}c_{ip}$ ("maxsumint", default method);
$i^{*}$ that maximizes $\sum_{p \in J_j}c_{ip}- \sum_{p \not\in J_j}c_{ip}$ ("maxsumdiff");
$i^{*}$ that minimizes $\sum_{p \not\in J_j}c_{ip}$ ("minsumnoint").

These three methods are applied by the internal function piv_sel. Alternatively, when $k <5$, the user can set piv.criterion="MUS" [@egidi2018mus] which performs a sequential search of identity submatrices within the matrix $C$ via the internal function MUS.

The function piv_rel:

performs the relabelling algorithm using the $k$ pivotal units as groups identifiers. The pivotal units previously detected play a central role, yielding to the following relabelling for the $h$-th iteration:

\begin{align} [\mu_{j}]h=&[\mu{z_{i_{j}}}]h \ [z{i}]h =j & \mbox{ for } i : [z_i]_h=[z{i_{j}}]_h,\ \end{align} where $\boldsymbol{\mu}=(\mu_1,\mu_2,\ldots,\mu_k)$ is the vector of the means parameters and $\boldsymbol{z}=(z_1,z_2,\ldots,z_N)$ an i.i.d. vector of latent variables taking values in ${1,2,\ldots,k }$ and denoting the group membership of each statistical unit.

piv_rel takes as input the MCMC output from piv_MCMC and returns the relabelled chains and the corresponding posterior estimates.

Example: bivariate Gaussian data

Suppose now that $\boldsymbol{y}_i \in \mathbb{R}^2$ and assume that:

$$\boldsymbol{y}i \sim \sum{j=1}^{k}\eta_{j}\mathcal{N}{2}(\boldsymbol{\mu}{j}, \boldsymbol{\Sigma})$$ where $\boldsymbol{\mu}j$ is the mean vector for group $j$, $\boldsymbol{\Sigma}$ is a positive-definite covariance matrix and the mixture weight $\eta_j= P(z_i=j)$ as for the one-dimensional case. We may generate Gaussian mixture data through the function piv_sim, specifying the sample size $N$, the desired number of groups $k$ and the $k \times 2$ matrix for the $k$ mean vectors. The argument W handles the weights for a nested mixture, in which the $j$-th component is in turn modelled as a two-component mixture, with covariance matrices $\boldsymbol{\Sigma}{p1}, \boldsymbol{\Sigma}_{p2}$, respectively.

set.seed(500)
N  <- 200
k  <- 3
D <- 2
nMC <- 2000
M1 <- c(-10,8)
M2 <- c(10,.1)
M3 <- c(30,8)
# matrix of input means
Mu <- rbind(M1,M2,M3)
# covariance matrices for the two subgroups
Sigma.p1 <- diag(D)
Sigma.p2 <- (10^2)*diag(D)
# subgroups' weights
W   <- c(0.2,0.8)
# simulate data
sim <- piv_sim(N = N, k = k, Mu = Mu,
 Sigma.p1 = Sigma.p1, Sigma.p2 = Sigma.p2, W = W)

The function piv_MCMC requires only three mandatory arguments: the data object y, the number of components k and the number of MCMC iterations, nMC. By default, it performs Gibbs sampling using the runjags package. If software="rjags", for bivariate data the priors are specified as:

\begin{align} \boldsymbol{\mu}_j \sim & \mathcal{N}_2(\boldsymbol{\mu}_0, S_2)\ \Sigma^{-1} \sim & \mbox{Wishart}(S_3, 3)\ \eta \sim & \mbox{Dirichlet}(\boldsymbol{\alpha}), \end{align}

where $\boldsymbol{\alpha}$ is a $k$-dimensional vector and $S_2$ and $S_3$ are positive definite matrices. By default, $\boldsymbol{\mu}_0=\boldsymbol{0}$, $\boldsymbol{\alpha}=(1,\ldots,1)$ and $S_2$ and $S_3$ are diagonal matrices, with diagonal elements equal to 1e+05. Different values can be specified for the hyperparameters $\boldsymbol{\mu}_0, S_2, S_3$ and $\boldsymbol{\alpha}$: priors =list(mu_0 = c(1,1), S2 = ..., S3 = ..., alpha = ...)}, with the constraint for $S2$ and $S3$ to be positive definite, and $\boldsymbol{\alpha}$ a vector of dimension $k$ with nonnegative elements.

If software="rstan", the function performs Hamiltonian Monte Carlo (HMC) sampling. In this case the priors are specified as:

\begin{align} \boldsymbol{\mu}j \sim & \mathcal{N}_2(\boldsymbol{\mu}_0, LD^*L^{T})\ L \sim & \mbox{LKJ}(\eta)\ D{1,2} \sim & \mbox{HalfCauchy}(0, \sigma_d). \end{align}

The covariance matrix is expressed in terms of the LDL decomposition as $LD^L^{T}$, a variant of the classical Cholesky decomposition, where $L$ is a $2 \times 2$ lower unit triangular matrix and $D^$ is a $2 \times 2$ diagonal matrix. The Cholesky correlation factor $L$ is assigned a LKJ prior with $\epsilon$ degrees of freedom, which, combined with priors on the standard deviations of each component, induces a prior on the covariance matrix; as $\epsilon \rightarrow \infty$ the magnitude of correlations between components decreases, whereas $\epsilon=1$ leads to a uniform prior distribution for $L$. By default, the hyperparameters are $\boldsymbol{\mu}_0=\boldsymbol{0}$, $\sigma_d=2.5, \epsilon=1$. Different values can be chosen with the argument: priors=list(mu_0=c(1,2), sigma_d = 4, epsilon = 2).

We fit the model using rjags with 2000 MCMC iterations and default priors:

res <- piv_MCMC(y = sim$y, k= k, nMC =nMC, 
                piv.criterion = "maxsumdiff")

Once we obtain posterior estimates, label switching is likely to occurr. For such a reason, we need to relabel our chains as explained above. In order to relabel the chains, the function piv_rel can be used, which only needs the mcmc = res argument. Relabelled outputs can be displayed through the function piv_plot, with different options for the argument type:

chains: plot the relabelled chains;
hist: plot the point estimates against the histogram of the data.

The optional argument par takes four possible alternative choices: mean, sd, weight and all for the means, standard deviations, weights or all the three mentioned parameters, respectively. By default, par="all".

rel <- piv_rel(mcmc=res)
piv_plot(y = sim$y, mcmc = res, rel_est = rel, par = "mean", type = "chains")
piv_plot(y = sim$y, mcmc = res, rel_est = rel, type = "hist")

Example: fishery data

The Fishery dataset has been previously used by @titterington1985statistical and @papastamoulis2016label and consists of 256 snapper length measurements. It is contained in the bayesmix package [@grun2011bayesmix]. We may display the histogram of the data, along with an estimated kernel density.

data(fish)
y <- fish[,1]
hist(y, breaks=40, prob = TRUE, cex.lab=1.6,
             main ="Fishery data", cex.main =1.7,
             col="navajowhite1", border="navajowhite1")
 lines(density(y), lty=1, lwd=3, col="blue")

We assume a mixture model with $k=5$ groups:

\begin{equation} y_i \sim \sum_{j=1}^k \eta_j \mathcal{N}(\mu_j, \sigma^2_j), \ \ i=1,\ldots,n, \label{eq:fishery} \end{equation} where $\mu_j, \sigma_j$ are the mean and the standard deviation of group $j$, respectively. Moreover, assume that $z_1,\ldots,z_n$ is an unobserved latent sequence of i.i.d. random variables following the multinomial distribution with weights $\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{k})$, such that:

$$P(z_i=j)=\eta_j,$$ where $\eta_j$ is the mixture weight assigned to the group $j$.

We fit our model by simulating $H=15000$ samples from the posterior distribution of $(\boldsymbol{z}, \boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\eta})$. In the univariate case, if the argument software="rjags" is selected (the default option), Gibbs sampling is performed by the package bayesmix, and the priors are:

\begin{eqnarray} \mu_j \sim & \mathcal{N}(\mu_0, 1/B_0)\ \sigma_j \sim & \mbox{invGamma}(\nu_0/2, \nu_0S_0/2)\ \eta \sim & \mbox{Dirichlet}(\boldsymbol{\alpha})\ S_0 \sim & \mbox{Gamma}(g_0/2, g_0G_0/2), \end{eqnarray}

with default values: $B_0=0.1$, $\nu_0 =20$, $g_0 = 10^{-16}$, $G_0 = 10^{-16}$, $\boldsymbol{\alpha}=(1,1,\ldots,1)$. The users may specify their own hyperparameters with the priors arguments, declaring a names list such as: priors = list(mu_0=2, alpha = rep(2, k), ...).

If software="rstan" is selected, the priors are:

\begin{eqnarray} \mu_j & \sim \mathcal{N}(\mu_0, 1/B0inv)\ \sigma_j & \sim \mbox{Lognormal}(\mu_{\sigma}, \tau_{\sigma})\ \eta_j & \sim \mbox{Uniform}(0,1), \end{eqnarray}

where the vector of the weights $\boldsymbol{\eta}=(\eta_1,\ldots,\eta_k)$ is a $k$-simplex. Default hyperparameters values are: $\mu_0=0, B0inv=0.1, \mu_{\sigma}=0, \tau_{\sigma}=2$. Here also, the users may choose their own hyperparameters values in the following way: priors = list(mu_sigma = 0, tau_sigma = 1,...).

We fit the model using the rjags method, and we set the burnin period to 7500.

k <- 5
nMC <- 15000
res <- piv_MCMC(y = y, k = k, nMC = nMC, 
                burn = 0.5*nMC, software = "rjags")

First of all, we may access the true number of iterations by tiping:

res$true.iter

We may have a glimpse if label switching ocurred or not by looking at the traceplot for the mean parameters, $\mu_j$. To do this, we apply the function piv_rel to relabel the chains and obtain useful inferences; the only argument for this function is the MCMC result just obtained with piv_MCMC. The function piv_plot displays some graphical tools, both traceplots (argument type="chains") and histograms along with the final relabelled means (argument type="hist"). For both plot ttpes, the function returns a printed message explaining how to interpret the results.

rel <- piv_rel(mcmc=res)
piv_plot(y=y, res, rel, par = "mean", type="chains")
piv_plot(y=y, res, rel, type="hist")

The first plot displays the traceplots for the parameters $\boldsymbol{\mu}$. From the left plot showing the raw outputs as given by the Gibbs sampling, we note that label switching clearly occurred. Our algorithm seems able to reorder the mean $\mu_j$ and the weights $\eta_j$, for $j=1,\ldots,k$. Of course, a MCMC sampler which does not switch the labels would ideal, but nearly impossible to program. However, we could assess how two diferent sampler perform, by repeating the analysis above by selecting software="rstan" in the piv_MCMC function.

# stan code not evaluated here
res2 <- piv_MCMC(y = y, k = k, nMC = 3000, 
                 software = "rstan")
rel2 <- piv_rel(res2)
piv_plot(y=y, res2, rel2, par = "mean", type="chains")

With the rstan option, we can use the bayesplot functions on the \code{stanfit} argument:

# stan code not evaluated here
posterior <- as.array(res2$stanfit)
mcmc_intervals(posterior, regex_pars = c("mu"))

Regardless of the software that we chose, we may extract the JAGS/Stan model by typing:

cat(res$model)

In order to estimate the number of clusters in the data, we can now fit sparse finite mixtures as proposed by @fruhwirth2019here by assuming:

$$\boldsymbol{\eta} \sim \text{Dirichlet}(e_0),$$

where the smaller $e_0$ and the the smaller the number of clusters a-posteriori. The function allows for a Gamma prior on $e_0$ with hyperparameters $a_e, b_e$ that may be chosen by the users.

res3 <- piv_MCMC(y = y, k = k, nMC = nMC, sparsity = TRUE,
                 priors = list(alpha = rep(0.001, k))) # sparse on eta
barplot(table(res3$nclusters), xlab= expression(K["+"]),
        col = "blue", border = "red", main = expression(paste("p(",K["+"], "|y)")),
        cex.main=3, yaxt ="n", cex.axis=2.4, cex.names=2.4,
        cex.lab=2)