MoMPCA - A package for the clustering of count data
In MoMPCA: Inference and Clustering for Mixture of Multinomial Principal Component Analysis

set.seed(42)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(MoMPCA)
library(aricode)

Description

MMPCA is a package to perform clustering of count data based on the mixture of multinomial PCA model. It integrates a dimension reduction aspect by factorizing the multinomial parameters in a latent space, like Latent Dirichlet Allocation of Blei et. al. It specially conceived for low sample high-dimensional data. Due to the intensive nature of the greedy algorithm, it is not suited for large sample size.

Dataset

The package contains attached data in BBCmsg. It consists in 4 text document already preprocessed with the tm package. It is mostly useful for the simulate_BBC() function.

data("BBCmsg")

Demonstration for document clustering

Start by generating data from the MMPCA model with a particular $\theta^\star$ and $\beta^\star$. For more detail, check experimental section of the paper.

N = 200
L = 250
simu <- simulate_BBC(N, L, epsilon = 0, lambda = 1)
Ytruth <- simu$Ytruth

Then perform clustering

t0 <- system.time(res <- mmpca_clust(simu$dtm.full, Q = 6, K = 4,
                          Yinit = 'random',
                          method = 'BBCVEM',
                          max.epochs = 7,
                          keep = 1,
                          verbose = 2,
                          nruns = 2,
                          mc.cores = 2)
               )
print(t0)

Results analysis

tab <- knitr::kable(table(res@clustering, Ytruth), format = 'markdown')
print(tab)
cat('Final ARI is ', aricode::ARI(res@clustering, Ytruth))

Other visualization are also accessible from the plot function. Which takes several arguments

cond = requireNamespace("ggplot2", quietly = TRUE) & requireNamespace("dplyr", quietly = TRUE) & requireNamespace("tidytext", quietly = TRUE)

ggtopics <- plot(res, type = 'topics')
print(ggtopics)

ggbound <- plot(res, type = 'bound')
print(ggbound)

Model selection

The package contains a convenient wrapper around mmpca_clust() which performs model selection over a grid of values for $(K,Q)$. Here is the results for Qs = 5:7 and Ks = 3:5.

t1 <- system.time(res <- mmpca_clust_modelselect(simu$dtm.full, Qs = 5:7, Ks = 3:5,
                          Yinit = 'kmeans_lda',
                          init.beta = 'lda',
                          method = 'BBCVEM',
                          max.epochs = 7,
                          nruns = 3,
                          verbose = 1)
               )
print(t1)
best_model = res$models
print(best_model)

Any scripts or data that you put into this service are public.

MoMPCA documentation built on Jan. 21, 2021, 5:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MoMPCA
Inference and Clustering for Mixture of Multinomial Principal Component Analysis

MoMPCA - A package for the clustering of count data
In MoMPCA: Inference and Clustering for Mixture of Multinomial Principal Component Analysis

Description

Dataset

Demonstration for document clustering

Results analysis

Model selection

Try the MoMPCA package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

MoMPCA Inference and Clustering for Mixture of Multinomial Principal Component Analysis

MoMPCA - A package for the clustering of count data In MoMPCA: Inference and Clustering for Mixture of Multinomial Principal Component Analysis

Description

Dataset

Demonstration for document clustering

Results analysis

Model selection

Try the MoMPCA package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

MoMPCA
Inference and Clustering for Mixture of Multinomial Principal Component Analysis

MoMPCA - A package for the clustering of count data
In MoMPCA: Inference and Clustering for Mixture of Multinomial Principal Component Analysis