Topic modelling is an established and useful unsupervised learning technique for detecting topics in large collections of text. Among the many variants of topic modelling and innovations that are been tried, the classic 'Latent Dirichlet Allocation' (LDA) algorithm remains to be a powerful tool that yields good results. Yet computing LDA topic models requires significant system resources, particularly when corpora grow large. The biglda package addresses three specific issues that may be outright obstacles to using LDA topic models productively and in a state-of-the-art fashion.
The package addresses issues of computing time and RAM limitations at the different relevant stages when working with topic models:
Data preparation and import: Issues of performance and memory efficiency start to arise when preparing input data for a training algorithm.
Fitting topic models: Computing time for a single topic model may be extensive. It is good practice to evaluate a set of topic models for hyperparameter optimization. So you need a fast implementation of LDA topic modelling to be able to fit a set of topic models within reasonable time.
To cost of interfacing: R is very productive for developing your analysis, but the best implementations of LDA topic modelling are written in Python, Java and C. Transferring the data between programming languages is a potential bottleneck and can be quite slow.
Evaluating topic models: Issues performance and memory efficiency are relevant when computing indicators to assess topic models may be considerable. Computations require mathematical operations with really large matrices. Memory may be exhausted easily.
More plastically spoken: As you work with increasingly large data, fitting and evaluating topic models can bring hours and days you wait for a result to emerge, just to see that the process crashes because memory has been exhausted.
The biglda package addresses performance and memory efficiency issues for R users
at all stages. The ParallelTopicModel class from the Machine Learning for Language Toolkit
'mallet' offers a fast implementation of an LDA topic model that yields good
results. The purpose of the 'biglda' package is to offer a seamless and fast
interface to the Java classes of 'mallet' so that the multicore implementation
of the LDA algorithm can be used. The as_LDA()
function can be used to map the
mallet model on the LDA_Gibbs
class from the widely used topicanalysis
package.
The biglda package is a GitHub-only package at this time. Install the stable version as follows.
remotes::install_github("PolMine/biglda")
Install the development version of the package as follows.
remotes::install_github("PolMine/biglda", ref = "dev")
The focus of the biglda package is to offer a seamless interface to Mallet. The
biglda::mallet_install()
function installs Mallet (v202108) in a directory
within the package. The big disadvantage of this installation mechanism is that
the Mallet installation is overwritten whenever you install a new version of
biglda, and Mallet needs to be installed anew. This is why installing Mallet in
the storage location used by your system for add-on software (such as "/opt" on
Linux/macOS) is recommended.
We strongly recommend to install the latest version of Mallet (v202108 or
higher). Among others, it includes a new $printDenseDocumentTopics()
method of
the ParallelTopicModel
used by biglda::save_document_topics()
, improving
the efficiency of moving data from Java/Mallet to R significantly.
Note that v202108 is a "serialization-breaking release". Instance files and binary models prepared using previous versions cannot be processed with this version and may have to be rebuilt.
On Linux and macOS machines, you may use the following lines of code for installing Mallet. Note that admin privileges ("sudo") may be required.
```{sh sh_install_mallet, eval = FALSE} mkdir /opt/mallet cd /opt/mallet wget https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gz tar xzfv Mallet-202108-bin.tar.gz rm Mallet-202108-bin.tar.gz
Using Mallet with the interface offered by biglda requires a working installation of the rJava package. Unfortunately, it happens indeed that installing rJava causes headaches. A solution that works very often is to reconfigure the R-Java intervace running the following command on the command line. ```{sh javareconf, eval = FALSE} R CMD javareconf
It goes beyond this introduction to list all potential solutions for rJava problems The prospect that Mallet is a very good and efficient tool may serve as a motivation.
The equivalent to rJava as an interface to Java is the reticulate package as an interface for running Python commands from R. The reticulate package needs to be installed and loaded.
install.packages("reticulate") library(reticulate)
The reticulate package typically runs Python within a Conda environment. The
easiest way is to use install_miniconda()
and to install Gensim as follows.
reticulate::install_miniconda() reticulate::conda_install(packages = "gensim")
Note that it is not possible to use the R packages "biglda" and "mallet" in
parallel. If "mallet" is loaded, it will put its Java Archive on the classpath
of the Java Virtual Machine (JVM), making the latest version of the
ParallelTopicModel
class inaccessible and causing errors that may be difficult
to understand. Therefore, a warning will be issued when "biglda" may detect that
the JAR included in the "mallet" package is in the classpath.
options(java.parameters = "-Xmx8g") Sys.setenv(MALLET_DIR = "/opt/mallet/Mallet-202108") library(biglda)
When using Mallet for topic modelling, the first step is to prepare a so-called
"instance list" with information on the input documents. In this example, we
use the AssociatedPress
dataset included in the topicmodels package.
data("AssociatedPress", package = "topicmodels") instance_list <- as.instance_list(AssociatedPress, verbose = FALSE)
We then instantiate a BigTopicModel
class object with hyperparameters. This
class is a super class of the ParallelTopicModel
class, the efficient worker
of Mallet topic modelling. The superclass adds methods to this class that
greatly speed up transferring data at the R-Java interface.
BTM <- BigTopicModel(n_topics = 100L, alpha_sum = 5.1, beta = 0.1)
This result (object BTM
) is a Java object of class "jobjRef". Methods of this
(Java) class are accessible from R, and are used to configure the topic modelling
engine. The first step is to add the instance list.
BTM$addInstances(instance_list)
We then set the number of iterations to 1000 and just use one thread core. Note that using multiple cores speeds up topic modelling - see performance assessment below.
BTM$setNumIterations(100L) BTM$setNumThreads(1L)
Finally, we control the verbosity of the engine. By default, Mallet issues a status message every 10 iterations and reports the top words for topics every 100 iterations. For the purpose of rendering this document, we turn these progress reports off.
BTM$setTopicDisplay(0L, 0L) # no intermediate report on topics BTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF) # remain silent
We now fit the topic model and report the time that has elapsed for fitting the model.
started <- Sys.time() BTM$estimate() Sys.time() - started
The package includes optimized functionality for evaluating the topic model. Metrics are computed as follows.
lda <- as_LDA(BTM, verbose = FALSE) N <- BTM$getDocLengthCounts() data.frame( arun2010 = BigArun2010(beta = B(lda), gamma = G(lda), doclengths = N), cao2009 = BigCao2009(X = B(lda)), deveaud2014 = BigDeveaud2014(beta = B(lda)) )
To use Gensim for topic modelling, first load the reticulate package and import gensim into the Python session.
library(reticulate) gensim <- reticulate::import("gensim")
The bottleneck for using Gensim for topic modelling from R is the interface
between the R and the Python session. We assume that a DocumentTermMatrix
has been prepared. The biglda package offers the functions dtm_as_bow()
and
dtm_as_dictionary()
to prepare the input data structures required by Gensim
topic modelling.
bow <- dtm_as_bow(AssociatedPress) dict <- dtm_as_dictionary(AssociatedPress)
We use the multicore implementation of LDA topic modelling and use all cores but two.
threads <- parallel::detectCores() - 2L
This is the genuine Python part - running Gensim.
gensim_model <- gensim$models$ldamulticore$LdaMulticore( corpus = py$corpus, id2word = py$dictionary, num_topics = 100L, iterations = 100L, per_word_topics = FALSE, workers = as.integer(threads) # required to be integer )
Use the as_LDA()
method to get back data from Pyhton/Gensim to R.
lda <- as_LDA(gensim_model, dtm = AssociatedPress, verbose = FALSE)
To convey that biglda is fast, we fit topic models with Mallet with different
numbers of cores and compare computing time with topic modelling with
topicmodels::LDA()
. The corpus used is AssociatedPress
as represented by
a DocumentTermMatrix
included in the topicmodel
package. Here, we just run
100 iterations for a limited set of k
topics. In "real life", you would have more
iterations (typically 1000-2000), but this setup is sufficiently informative
on performance.
So we start with the general settings.
k <- 100L iterations <- 100L n_cores_max <- parallel::detectCores() - 2L # use all cores but two data("AssociatedPress", package = "topicmodels") report <- list()
We then fit topic models with Mallet using a increasing number of cores.
library(biglda) instance_list <- as.instance_list(AssociatedPress, verbose = FALSE) for (cores in 1L:n_cores_max){ mallet_started <- Sys.time() BTM <- BigTopicModel(n_topics = k, alpha_sum = 5.1, beta = 0.1) BTM$addInstances(instance_list) BTM$setNumThreads(cores) BTM$setNumIterations(iterations) BTM$setTopicDisplay(0L, 0L) # no intermediate report on topics BTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF) # remain silent BTM$estimate() run <- sprintf("mallet_%d", cores) report[[as.character(cores)]] <- data.frame( tool = run, time = as.numeric(difftime(Sys.time(), mallet_started, units = "mins")) ) }
And we run the classic implementation of LDA topic topic modelling in the topicmodels package.
library(topicmodels) topicmodels_started <- Sys.time() lda_model <- LDA( AssociatedPress, k = k, method = "Gibbs", control = list(iter = iterations) ) report[["topicmodels"]] <- data.frame( tool = "topicmodels", time = difftime(Sys.time(), topicmodels_started, units = "mins") )
The following chart reports the elapsed time for LDA topic modelling.
library(ggplot2) ggplot(data = do.call(rbind, report), aes(x = tool, y = time)) + geom_bar(stat="identity")
These are the takeaways from the benchmarks:
Topic modelling with Mallet is fast and even more so with several cores. So for large data, biglda can make the difference, whether it takes a week or a day for fit the topic model.
Note that we did not an evaluation of stm::stm()
(structural topic model)
here, which has become a widely used state-of-the-art algorithm. The stm package
offers very rich analytical possibilities and the ability to include document
metadata goes significantly beyond classic LDA topic modelling. But stm::stm()
is significantly slower than topicmodels::LDA()
. The stm package does not
address big data scenarios very well, this is the specialization of the biglda
package.
The great reputation of Gensim notwithstanding, benchmarks for Gensim are significantly weaker than what we see for mallet and topicmodels. The reticulate interface may be the bottleneck: We do not yet know.
The package implements the metrics for topic models of the ldatuning package
(Griffiths2004 still missing) using RcppArmadillo. Based on the following
code used for benchmarking, the following chart conveys that the biglda funtions
BigArun2010()
, BigCao2009()
and BigDeveaud2014()
are significantly faster
than their counterparts in the ldatuning package.
library(ldatuning) metrics_timing <- data.frame( pkg = c(rep("ldatuning", times = 3), rep("biglda", times = 3)), metric = c(rep(c("Arun2010", "CaoJuan2009", "Deveaud2009"), times = 2)), time = c( system.time(Arun2010(models = list(lda_model), dtm = AssociatedPress))[3], system.time(CaoJuan2009(models = list(lda_model)))[3], system.time(Deveaud2014(models = list(lda_model)))[3], system.time(BigArun2010( beta = B(lda_model), gamma = G(lda_model), doclengths = BTM$getDocLengthCounts()))[3], system.time(BigCao2009(B(lda_model)))[3], system.time(BigDeveaud2014(B(lda_model)))[3] ) ) ggplot(metrics_timing, aes(fill = pkg, y = time, x = metric)) + geom_bar(position = "dodge", stat = "identity")
Note that apart from speed, the RcppArmadillo/C++ implementation is much more memory efficient. Computations of metrics that may fail with ldatuning functionality because memory is exhausted will often work fast and successfully with the biglda package.
The examples here and in the package documentation including vignettes need to be minimal working examples to keep computing time low when building the package. In real-life scenarios, we recommend to use scripts for topic modelling on large datasets and to execute the script from the command line. R Markdown are an ideal alternative to plain R scripts that can be run with a shell command such as ``.
The R package "mallet" has been around for a while and is the traditional
interface to the "mallet" Java package for R users. However, the a RTopicModel
class is used as an interface to the ParallelTopicModel
class, hiding the
method to define the number of cores to be used from the R used, thus limiting
the potential to speed up the computation of a topic model using multiple
threads. What is more, functionality for full access to the computed model is
hidden, inhibiting the extraction of the information the mallet LDA topic model
that is required to map the topic model on the LDA_Gibbs topic model.
The biglda package is designed for heavy-duty work with large data sets. If
you work with smaller data, other approaches and implementations that are
established and that may be easier to use and to install (the rJava Blues!)
do a very good job indeed. But we are not just happy to also have biglda. Jobs
with big data may just take too long or simply fail because memory is exhausted.
We think that biglda pushes the frontier of being able of productively using
topic modelling in the R realm.
(Alternative: Command line use of Mallet. Quarto.)
David Minmo, "The Details: Training and Validating Big Models on Big Data": http://journalofdigitalhumanities.org/2-1/the-details-by-david-mimno/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.