knitr::opts_chunk$set( collapse = TRUE, fig.path = "man/figures/" )
Cluster any count data matrix with a fixed number of variables. Implements the branch & bound Classification-Variational Expectation-Maximisation of this paper (to appear in Computational Statistics).
MoMPCA is available on CRAN and the development version available on Github.
MoMPCA needs the following CRAN R packages, so check that they are are installed on your computer.
required_CRAN <- c("methods", "topicmodels", "tm", "Matrix", "slam", "magrittr", "dplyr", "stats", "doParallel", "foreach", "ggplot2", "reshape2", "tidytext") not_installed_CRAN <- setdiff(required_CRAN, rownames(installed.packages())) if (length(not_installed_CRAN) > 0) install.packages(not_installed_CRAN)
install.packages("MoMPCA")
remotes::install_github("nicolasJouvin/MoMPCA")
The package comes with the BBCmsg data set and a simulate_BBC()
function wich allows to reproduce the simulation of the paper.
library(MoMPCA) simu <- simulate_BBC(N = 400, L = 200, epsilon = 0, lambda = 1) dtm <- simu$dtm.full Ytruth <- simu$Ytruth # true clustering
The dtm
is a tm::DocumentTermMatrix()
object. The main fitting function is mmpca_clust()
, which allow for a parralel backend via its argument mc.cores
. There is a simple wrapper around this function called mmpca_clust_modelselect()
which allows for model selection of (Q, K)
with an ICL criterion. Please be aware that the greedy nature of the algorithm may induce quite intensive computations.
res <- mmpca_clust(simu$dtm.full, Q = 6, K = 4, Yinit = 'random', method = 'BBCVEM', max.epochs = 7, keep = 1, verbose = 2, nruns = 2, mc.cores = 1)
The top words of the topic matrix beta
can then be plotted (if working with text)
ggtopics <- plot(res, type = 'topics', n_words = 5) print(ggtopics)
And the bound evolution throughout the epochs
ggbound <- plot(res, type = 'bound') print(ggbound)
res <- mmpca_clust_modelselect(simu$dtm.full, Qs = 5:7, Ks = 3:5, Yinit = 'kmeans_lda', init.beta = 'lda', method = 'BBCVEM', max.epochs = 7, nruns = 3, verbose = 1) best_model = res$models
Please cite our work using the following reference:
@article{jouvin:hal-02278224, TITLE = {{Greedy clustering of count data through a mixture of multinomial PCA}}, AUTHOR = {Jouvin, Nicolas and Latouche, Pierre and Bouveyron, Charles and Bataillon, Guillaume and Livartowski, Alain}, URL = {https://hal.archives-ouvertes.fr/hal-02278224}, NOTE = {31 pages, 10 figures}, JOURNAL = {{Computational Statistics}}, PUBLISHER = {{Springer Verlag}}, YEAR = {2020}, KEYWORDS = {Dimension reduction ; Topic modeling ; Count data ; Mixture models ; Clustering ; Variational inference}, HAL_ID = {hal-02278224}, HAL_VERSION = {v1}, }
and consider citing this package
citation('MoMPCA')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.