Using CliPS to Identify a Dynamic Mixture of Finite Mixtures of Multivariate Gaussian Distributions -- Case Study: Diabetes Data"
In telescope: Bayesian Mixtures with an Unknown Number of Components

knitr::opts_chunk$set(echo = TRUE)

Introduction

This vignette allows to reproduce the results for the case study discussed in @Malsiner+Fruehwirth+Gruen:2026 for the diabetes data set from package mclust. In particular, we illustrate the CliPS approach proposed in @Malsiner+Fruehwirth+Gruen:2026 for a Bayesian multivariate Gaussian mixture with a prior on the number of components $K$. First, we fit the dynamic mixture of multivariate Gaussian mixtures to the data using the telescoping sampler. Then, based on the posterior draws, we identify the mixture by clustering the component means in the point process representation. The numbering of the following steps is aligned with the numbering of the CLiPS procedure in @Malsiner+Fruehwirth+Gruen:2026.

We start by loading the package, the data and plotting the data.

library("telescope")
data("diabetes", package = "mclust")
y <- diabetes[, c("glucose", "insulin", "sspg")]
z <- diabetes[, "class"]

pairs(y, col = c("darkred", "blue", "darkgreen")[z], pch = 19)

Step 1: Define a mixture model

We use the prior specification and the telescoping sampler for MCMC sampling as proposed in @Fruehwirth-Schnatter+Malsiner-Walli+Gruen:2021. In particular, the prior on the number of components $K$ is selected as a beta-negative-binomial distribution with parameters $(1,4,3)$. The mixture weights a-priori have a Dirichlet distribution with parameter $\alpha/K$ where $\alpha = 0.5$.

priorOnK <- priorOnK_spec("BNB_143")
priorOnWeights <- priorOnAlpha_spec("alpha_const", alpha = 0.5)

We specify the prior on the component parameters as in @Fruehwirth-Schnatter+Malsiner-Walli+Gruen:2021.

r <- ncol(y)
R <- apply(y, 2, function(x) diff(range(x)))
b0 <- apply(y, 2, median)
B_0 <- rep(1, r)  
B0 <- diag((R^2) * B_0)
c0 <- 2.5 + (r-1)/2
g0 <- 0.5 + (r-1)/2
G0 <- 100 * g0/c0 * diag((1/R^2), nrow = r)
C0 <- g0 * chol2inv(chol(G0))

For the sampling, we set the number of burn-in iterations burnin to be discarded and the number of recorded iterations M after thinning with thin.

M <- 1000
burnin <- 1000
thin <- 1

$k$-means clustering into 3 groups is used to get an initial partition which is used to determine initial values for the component specific means and covariances. The component weights are initialized with equal weights.

set.seed(4711)
z0 <- kmeans(y, centers = 3, nstart = 100)$cluster
mu <- sapply(split(y, z0), colMeans)
Sigma <- array(sapply(split(y, z0), var), c(r, r, ncol(mu)))
eta <- rep(1/ncol(mu), ncol(mu))

Step 2: MCMC sampling from the posterior

Using this prior specification as well as initialization and MCMC settings, we draw samples from the posterior using the telescoping sampler. For computational ease, we set a maximum number of components to be considered using Kmax.

Kmax <- 50
samples <-
    sampleMultNormMixture(
        y, z0, mu, Sigma, eta,
        c0, g0, G0, C0, b0, B0,
        M, burnin, thin, 
        Kmax = Kmax, G = "MixDynamic",
        priorOnK, priorOnWeights = priorOnWeights)

The sampling function returns a named list where the sampled parameters and latent variables are contained. The list includes the component means Mu, the weights Eta, the allocations S, the number of observations Nk assigned to components, the number of components K, the number of filled components Kplus, and the parameter values corresponding to the mode of the nonnormalized posterior nonnormpost_mode_list. These values are extracted for further post-processing.

Mu <- samples$Mu
Eta <- samples$Eta
S <- samples$S
Nk <- samples$Nk
K <- samples$K
Kplus <- samples$Kplus
nonnormpost_mode_list <- samples$nonnormpost_mode

Diagnostic plots of the run show the sampled $K$ and $K_+$ and the sampled weights $\eta_k$, see Figure 5 of @Malsiner+Fruehwirth+Gruen:2026.

par(mfrow = c(1, 2))
with(samples,
     matplot(burnin + 1:M, cbind(K, Kplus), type = "l", lty = 1,
             col = c("grey", "black"),
             xlab = "iteration", 
             ylab = expression(`K/` ~K["+"]),
             ylim = c(0, Kmax)))
matplot(burnin + 1:M, samples$Eta, type = "l", lty = 1, col = 1,
        xlab = "iteration", ylim = 0:1,
        ylab = expression(eta["k"]))

The following plot shows the pairwise scatter plot of all sampled component means (within suitable ranges), see Figure 6 in @Malsiner+Fruehwirth+Gruen:2026.

Mu_ <- do.call("rbind", lapply(1:Kmax, function(k) samples$Mu[,,k])) |>
    as.data.frame() |> na.omit()
colnames(Mu_) <- colnames(y)
Mu_ <- subset(Mu_,
              (glucose >= 50 & glucose <= 320) &
              (sspg >= 0 & sspg <= 500) &
              (insulin >= 300 & insulin <= 1300))
pairs(Mu_, col = rgb(0, 0, 0, 0.2), pch = 19)

Step 2b: Post-processing the MCMC draws

We determine the posterior of the number of filled components.

(p_Kplus <- tabulate(Kplus, nbins = max(Kplus))/M)

The number of clusters $\hat{K}+$ is estimated by taking the mode of the posterior of $K+$.

Kplus_hat <- which.max(p_Kplus)
Kplus_hat

The number of draws $M_0$ where $K_+ = \hat{K}_+$ is determined.

M0 <- sum(Kplus == Kplus_hat)
M0

We determine the indices of those iterations which have exactly Kplus_hat filled components. For each parameter, we extract those draws with exactly $\hat{K}_+$ filled components and eliminate the draws of empty components.

index <- Kplus == Kplus_hat

Nk[is.na(Nk)] <- 0
Nk_Kplus <- (Nk[index, ] > 0)

Mu_inter <- Mu[index, , , drop = FALSE]
Mu_Kplus <- array(0, dim = c(M0, r, Kplus_hat)) 
for (j in 1:r) {
  Mu_Kplus[, j, ] <- Mu_inter[, j, ][Nk_Kplus]
}

Eta_inter <- Eta[index, ]
Eta_Kplus <- matrix(Eta_inter[Nk_Kplus], ncol = Kplus_hat)

w <- which(index)
S_Kplus <- matrix(0, M0, nrow(y))
for (i in seq_along(w)) {
  m <- w[i]
  perm_S <- rep(0, Kmax)
  perm_S[Nk[m, ] != 0] <- 1:Kplus_hat
  S_Kplus[i, ] <- perm_S[S[m, ]]
}

Steps 3-4-5: Clustering of the draws

We call the function identifyMixture() of the package telescope to cluster the draws. The argument Func contains the array of the functional values $\phi(\theta_k)$ with dimension ($M_0 \times K_+ \times d$), where $d= dim(\phi(\theta_k))$. This array contains the draws for clustering. Func_init contains the centers of the clusters used for initializing $k$-means. The draws in Mu_Kplus, Eta_Kplus, S_Kplus are reordered according to the classification sequence obtained with the $k$-means algorithm.

Func_init <- t(nonnormpost_mode_list[[Kplus_hat]]$mu)
identified_Kplus <- identifyMixture(
    Func = Mu_Kplus, Mu_Kplus, Eta_Kplus, S_Kplus, Func_init)

identifyMixture() returns a named list where S, Mu, and Eta contain the relabed draws after having discarded draws which are not permutations, non_perm_rate gives the non-permutation rate, and class contains the labels of the functionals obtained with the $k$-means algorithm.

identified_Kplus$non_perm_rate

The non-permutation rate is $r identified_Kplus$non_perm_rate$.

We visualize the relabeled draws of the component means.

Mu_ <- do.call("rbind", lapply(1:Kplus_hat, function(k) identified_Kplus$Mu[,,k])) |>
    as.data.frame()
colnames(Mu_) <- colnames(y)
z_ <- identified_Kplus$class
COLS <- apply(rbind(col2rgb(c("darkred", "blue", "darkgreen")),
                    alpha = 0.2 * 255) / 255,
              2, function(x) do.call("rgb", as.list(x)))
pairs(Mu_, col = COLS[z_], pch = 19)

Step 6: Characterization of the cluster distributions

We use the relabeled draws to characterize the cluster distribution. We estimate the cluster specific parameters (e.g., posterior means and cluster sizes) and determine the final partition by assigning each observation to the cluster where it was assigned most frequently. The final partition is stored in z_sp. Finally, the estimated clustering solution is visualized.

colMeans(identified_Kplus$Mu)
colMeans(identified_Kplus$Eta)

z_sp <- apply(identified_Kplus$S, 2,
              function(x) which.max(tabulate(x, Kplus_hat)))
table(z_sp)
table(z, z_sp)

library("mclust")
1 - classError(z_sp, z)$errorRate
adjustedRandIndex(z, z_sp)

pairs(y, col = c("darkred", "blue", "darkgreen")[z_sp], pch = 19)