knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" )
```{css, echo=FALSE} p.caption { font-style: italic; }
Loads the necessary packages for the vignettes. ```r library(future) # allows parralel processing in greed() library(greed) library(aricode) # for ari computation library(dplyr) # data wrangling library(tidyr) library(purrr) library(ggpubr) # advanced ploting library(ggplot2) library(ggwordcloud) library(careless) # survey pre-processing set.seed(2134) future::plan("multisession", workers=5) # may be increased
The greed
package and DLVM framework allows the clustering of categorical data. This vignette describes typical use cases of the greed()
function in this context, and illustrates its use on real datasets.
We are interested in the clustering of categorical datasets, which are typically found in survey data or item response theory (ITR). In this context, we observe $n$ individuals described by $p$ variables, taking one among $d_j$ modalities for each variable $j$. Such datasets are typically represented using a one-hot-encoding of each factor in a design matrix $\mathbf{X} \in {0,1}^{n \times d}$ where $d = \sum_{j=1}^p d_j$. Latent class analysis (LCA) is a generative model for categorical data clustering which posits conditional independence of the factor variables conditionally on the (unknown) partition. Below is a description of its Bayesian formulation with the use of proper conjugate priors [ \begin{equation} \begin{aligned} \pi&\sim \textrm{Dirichlet}(\alpha),\ \forall k, \forall j, \quad \theta_{kj} &\sim \textrm{Dirichlet}{d_j}(\beta), \ Z_i&\sim \mathcal{M}_K(1,\pi),\ \forall j=1, \ldots, p, \quad X{ij}|Z_{ik}=1 &\sim \mathcal{M}{d_j}(1, \theta{kj}),\ \end{aligned} \end{equation} ] For each cluster $k$ and variable $j$, the vector $\theta_{kj}$ represents the probability of each of the $d_j$ modalities. With the above choice of priors, the LCA model admits an exact ICL expression similar to the mixture of multinomials derived in the supplementary materials of Côme et. al. (2021, Section 3) .
The greed package implements this model via the Lca
model-class and the default values for hyper-parameters are set as follows:
This default values may be changed and we refer to the ?Lca
documentation for further details.
We now illustrate the use of the greed
package on two real datasets:
These two datasets come attached with greed
package.
data("mushroom")
We begin by forming the necessary vectors for analysis. The data has $n=r nrow(mushroom)
$ rows and $p=r ncol(mushroom)
$ columns. The first column contains the poisonous status of each mushroom with two possible values, "p" for "poisonous" and "e" for edible, it will serve as the clustering we seek to recover.
mushroom[,1:10] %>% head() %>% knitr::kable()
The remaining variables are used for clustering. Note that we only use a subset of the data for illustration purpose here.
X = mushroom[,-1] subset =sample(1:nrow(X), size = 1000) label = mushroom$edibility[subset]
The clustering is done via the main function greed()
with argument model
set to LCA and the genetic hybrid algorithm for ICL maximization.
Note: The
Lca
model may only be used with datasets stored in a data.frame object containing factors only. When such data are provided to thegreed()
function, anLca
model is picked by default. To perform the clustering, it is therefore sufficient to call greed with the prepared data.frame.
sol_mush<-greed(X[subset,], model=Lca())
The hybrid genetic algorithm found a solution with $K=r sol_mush@K
$ clusters which is quite over-segmented. But, it display a good separation among edible and poisonous mushrooms:
table(label, clustering(sol_mush)) %>% knitr::kable()
Partition's NMI is r round(NMI(label, clustering(sol_mush)),2)
which is explained by the over-segmentation of the solution compared to the $2$-class problem.
Exploring the dendrogram provided by the hierarchical algorithm is quite useful in this case.
plot(sol_mush, type='tree')
We clearly see a hierarchical structure appearing with $K=2$ or $K=4$ main clusters. Thus, we can cut the tree at this height and look at the solution.
sol2 = cut(sol_mush, 2) table(label, clustering(sol2)) %>% knitr::kable()
Here, we clearly see that the order of merges is consistent with the labels, and the final ARI is r round(NMI(clustering(sol2), label),2)
. While, some poisonous mushrooms have been categorized as edible, this is the consequence of the way the labels have been set, since mushrooms for which the edibility status was unknown were classified as poisonous by default. While this choice is reasonable from a strict health perspective. Furthermore, as the data documentation specifies, ''\textit{there is no simple rule for determining the edibility of a mushroom}''. Thus, the unsupervised problem is hard and the obtained clustering is already satisfactory. Moreover, this illustrates the interest of having the hierarchical algorithm in order to access coarser partitions.
data("Youngpeoplesurvey")
We begin by preprocessing the data, only keeping the categorical variable. The original dataset has $n=r nrow(Youngpeoplesurvey)
$ respondents for $p=r ncol(Youngpeoplesurvey)
$. We keep only the feature related to the musical taste of the respondent and remove potential strike of identical responses. Eventually, the data are cast to factors with an explicit levels for the missing responses.
selected = Youngpeoplesurvey %>% mutate(string = longstring(.)) %>% mutate(sel = if_else(string <= 10,TRUE,FALSE) ) %>% pull(sel) Xnum = Youngpeoplesurvey %>% filter(all_of(selected) ) X = Xnum %>% mutate_all(function(x){ x[is.na(x)]="NA" factor(x,levels=c(1:5,"NA")) }) %>% droplevels()
As previously the clustering is obtained with a call to greed and the Lca generative model will be taken by default since the dataset is a data.frame with factors only. We demonstrate here how to increase a little bit the exploration capabilities of the optimization algorithm by increasing the population size and the probability of mutation.
sol=greed(X, alg = Hybrid(pop_size = 50,prob_mutation = 0.5))
The algorithm found $K=r K(sol)
$ clusters which are quite balanced. To explore the results, we plot the dendogram found.
plot(sol,type='tree')
We may also used the marginals
plot to depict the conditional probabilities of the different responses knowing the clusters assignment.
plot(sol,type='marginals')
The clusters appear to be quite different and coherent. For example, the seventh group (second row, in brown) display a strong tendency to like Opera and Classical music and dislike more modern genres (Techno, Trance, Punk or Hi-Hop), whereas the 4th group (top row, in pink) corresponds to music lovers with quite eclectic musical taste across the proposed survey's genres.
In addition, the conditional maximum a posteriori estimates of $\theta_{kj}$ are available through the coef()
method.
params = coef(sol)
And look for example at the cluster distributions for the Country music style:
params$Thetak$Country %>% knitr::kable(digits=2)
Cluster structures may also be visualized with a word clouds of feature names. In this representation, a color encodes if the feature has an average score greater than the average score of the whole population. A big blue feature corresponds to music type that scores higher than the mean in this cluster and big red feature to music type that have a smaller score than the mean in the corresponding group.
# get the Lca MAP parameters estimates params = coef(sol) # compute the mean score per cluster and features means_scores = lapply(params$Thetak,function(x){ apply(x[,1:5],1,function(r){ sum(r*1:5) }) }) # move to long format means_scores_long = do.call(rbind, map2(means_scores, names(means_scores),function(x,y){ tibble(cluster=1:K(sol),mean=x,var=y) })) %>% mutate(var = gsub("\\."," ",var)) # compute the average score for the whole population means_scores_glob = Xnum %>% summarise_all(function(x){mean(x,na.rm=TRUE)}) %>% tidyr::pivot_longer(all_of(1:ncol(Xnum)),names_to = "var",) %>% mutate(var = gsub("[,-/]"," ",var)) # compute the differences between clusters scores and averages scores gg = means_scores_long %>% left_join(means_scores_glob) %>% mutate(dm=mean-value) # create the word clouds (we kept only the music genre that depart from the average for each cluster) ggplot(gg %>% filter(abs(dm)>0.1), aes(label = var, size = abs(dm),color=dm)) + geom_text_wordcloud() + scale_size_area(max_size = 7) + theme_minimal() + scale_color_gradient2(guide="none")+ facet_wrap(~cluster)
Using such a visualization allow to easily describe the different groups. The first cluster almost exclusively likes Pop and Dance music, the second Opera and Classical music. The third cluster has higher score for almost all music genres and the third is close to the global average score and so on.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.