knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" )
```{css, echo=FALSE} p.caption { font-style: italic; }
Loads packages. ```r library(future) # allows parralel processing in greed() library(greed) library(dplyr) library(ggplot2) set.seed(2134) future::plan("multisession", workers=2) # may be increased
Gaussian Mixture Models (GMMs) count among the most widely used DLVMs for continuous data clustering. The greed package handles this family of models and implements efficient visualization tools for the clustering results that we detail below.
Without any constraints, the Bayesian formulation of GMMs leading to a tractable exact ICL expression uses a Normal and inverse-Wishart conjugate prior on the mean and covariances $\mathbf{\theta} = (\mathbf{\mu}k, \mathbf{\Sigma}_k)_k$. This prior is defined with hyperparameters $\mathbf{\beta} = (\mathbf{\mu}, \tau, n_0, \mathbf{\varepsilon})$ and the hierarchical model is written as follows: ```{=tex} \begin{equation} \label{eq:gmm} \begin{aligned} \pi&\sim \textrm{Dirichlet}_K(\alpha)\ Z_i&\sim \mathcal{M}(1,\pi)\ \mathbf{\Sigma}_k^{-1} & \sim \textrm{Wishart}(\mathbf{\varepsilon}^{-1},n_0)\ \mathbf{\mu}_k&\sim \mathcal{N}(\mathbf{\mu},\frac{1}{\tau} \mathbf{\Sigma}_k)\ X{i}|Z_{ik}=1 &\sim \mathcal{N}(\mathbf{\mu}k, \mathbf{\Sigma}{k})\ \end{aligned} \end{equation}
Contrary to other models, these priors are informative and may therefore have a sensible impact on the obtained results. By default, the priors parameters are set as follow: - $\alpha=1$ - $\mu=\bar{X}$ - $\tau=0.01$ - $n_0=d$ - $\mathbf{\varepsilon}=0.1\,\textrm{diag}(\mathbf{\hat{\Sigma}}_{\mathbf{X}})$ These default values were chosen to accommodate a variety of situation but may be modified easily. For more details, we refer to the class documentation `?Gmm`. Finally, it is possible to specify constraints on the covariance matrix as discussed below. # An introductory example : the diabetes dataset Let us describe a first use case on the `diabetes` dataset from the **mclust** package. The data describes $p=3$ biological variables for $n=145$ patients with $K=3$ different types of diabetes: normal, overt and chemical. We are interested in the clustering of this dataset to see if the three biological variables can discriminate between the three diabetes type. ```r data(diabetes,package="mclust") X=diabetes[,-1] X = data.frame(scale(X)) # centering and scaling
We apply the greed()
function with a Gmm
object with default hyperparameters. The optimization algorithm used by default is the hybrid genetic algorithm of Come et. al (2021).
sol = greed(X,model=Gmm())
The optimal clustering found by the hybrid GA has $K^\star=3$ clusters. You may then easily extract the found clustering with the clustering()
method and compare it with the known classes:
table(diabetes$cl,clustering(sol)) %>% knitr::kable()
It roughly corresponds to the known groups with cluster 1,2 and 3 recovering overt, chemical and normal diabetes respectively, with some miss-classifications that are expected with GMM clustering on this dataset.
To get an overview of the clustering results, you may also use the gmmpairs()
plot function which displays the scatter plot matrix, colored by cluster membership, with estimated Gaussian ellipses in each cluster via maximum a posteriori:
gmmpairs(sol,X)
Furthermore, users may use the generic coef()
function to get the mixture component parameters, or rather their maximum a posteriori estimated conditionally on the partition. The result will be a simple list with fields for each parameters (means: muk
, covariance matrices: Sigmak
, proportions: pi
). Here is an example for the $3$-component GMM we just extracted, were we extract the mean of the first component.
params = coef(sol) names(params) params$muk[[1]]
If a user wants to experiment with other values of the prior hyperparameters, the most important is $\mathbf{\varepsilon}$, which control the clusters' covariance matrices Wishart prior. This amounts to specify the prior on the dispersion inside each class. For instance, one may specify a priori belief that the variance is small inside clusters, which amounts to diminish the $0.1$ coefficient in front of $\hat{\mathbf{\Sigma}}_{\mathbf{X}}$. In this case, it makes sense to decrease $\tau$ in the same proportions to keep a flat priors on the clusters means. For the diabetes data such a choice leads to an interesting solution, where the strong prior constraint leads to one cluster being created to fit one outlier in the "chemical" diabetes.
sol_dense = greed(X,model=Gmm(epsilon=0.01*diag(diag(cov(X))), tau =0.001)) gmmpairs(sol_dense,X)
When the number of variables $p$ becomes large, it can be of interest to reduce the flexibility of a GMM by adding a specific set of constraints. The diagonal covariance model, with constraint $\mathbf{\Sigma}k = \textrm{diag}(\sigma{k1}, \ldots, \sigma_{kp})$, is implemented in the greed package through its DiagGmm
S4 class. Its hierarchical formulation is similar to unconstrained GMM, except the Wishart prior on $\mathbf{\Sigma}k$ is now replaced by a Gamma prior on each $\sigma{kj}$ with shape $\kappa$ and rate $\beta$ hyperparameters defaulted to $1$ and mean of the empirical columns variances respectively.
Fitting this model on the diabetes dataset, greed()
with an hybrid GA finds a $4$-components diagonal GMM.
soldiag = greed(X,model=DiagGmm()) gmmpairs(soldiag,X)
As previously, the clustering is quite aligned with the known groups except that now two clusters were needed to fit the "overt" class which presents an important elongation.
table(diabetes$cl,clustering(soldiag)) %>% knitr::kable()
Moreover, we may investigate coarser partitions when inspecting a clustering. To plot the dendogram, one may simply use the plot function with the type option set to 'tree'
:
plot(soldiag,type='tree')
Then, one may use the cut()
to extract a new clustering at any level of the hierarchy. Here is the solution at $K=3$:
solK3 = cut(soldiag, K=3) table(diabetes$cl,clustering(solK3)) %>% knitr::kable()
Again, the clusters are relevant compared to the known groups.
Eventually, we may compare the ICL values between the full and diagonal models, using the ICL
method and compute their difference to get a quantity similar to Bayes factor with the difference that the partition was not marginalized out, but optimized.
cat('Full GMM has an ICL = ', ICL(sol), '.\n') cat('Diagonal GMM has an ICL = ', ICL(soldiag), '.\n') bayesfactor = ICL(sol)-ICL(soldiag) cat('Bayes factor = ', bayesfactor, '.\n')
The full model achieved a quite better fit on this dataset.
Note: A note on caution on these statistics which take into account the chosen hyperparmaters and may therefore be influenced by the chosen values, as a classical bayes factor. These value must therefore be chosen with care and eventually their impact investigated with a sensibility analysis.
As explained above, diagonal models are attractive when working in high-dimensional settings. Their interest is twofold. First, the number of free parameters in the generative model is reduced, which is useful to reduce computational time even though they are integrated out in the ICL. Second, the prior maybe defined such that it will be less informative. We illustrate this with a subset of the fashion-mnist
data provided with the package which contains $p=784$ dimensional vectors (corresponding to 28x28 flattened images of clothing). In this setting, the optimization algorithm is switched to ?`seed-class`
, which, while a bit less efficient than the hybrid algorithm used by default in greed()
, has a lowest computational burden since it relies on a single and carefully chosen seeded initialization. In addition, we also increase the initial value for $K$ in order to compensate for the seed
method's inability to find a value of $K$ bigger than its initialization.
data("fashion") dim(fashion) sol_fashion=greed(fashion,model=DiagGmm(),alg=Seed(),K=60)
On this more complex dataset, the clustering found by the algorithm has r K(sol_fashion)
clusters and the dendrogram proves very informative, highlighting the complex structure of these data.
plot(sol_fashion,type='tree')
Finally, the clusters centers look coherent and the hierarchical ordering underlines a form of "geometric proximities" between cluster objects.
plot_clust_centers <- function(sol) { clust_centers = coef(sol)$muk im_list=lapply(1:sol@K,function(k){ data.frame(i=rep(28:1,each=28),j=rep(1:28,28),v=t(clust_centers[[k]]),k=k) }) ims = do.call(rbind,im_list) gg = ggplot(ims)+ geom_tile(aes(y=i,x=j,fill=v))+ scale_fill_gradientn(colors=c("#ffffff","#000000"),guide="none")+ scale_x_continuous(breaks=c())+scale_y_continuous(breaks=c())+facet_wrap(~k)+ coord_equal()+theme_minimal()+ theme(axis.title.x = element_blank(),axis.title.y = element_blank()) return(gg) } plot_clust_centers(sol_fashion)
As already demonstrated, the clustering can be cut at a coarser levels, if one wants a more synthetic result. We may for example look at the centers for the $7$-components clustering:
sol_fashion_K7 = cut(sol_fashion, K=7) plot_clust_centers(sol_fashion_K7)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.