In PolMine/biglda: Fast LDA Topic Modelling for Big Corpora

Why

The purpose of the biglda package is to expose the fast parallelized topic model computation of the mallet package to handle big corpora efficiently. At this stage, this vignette includes only a minimal example to convey the fundamental workflow.

Installation

The biglda package is a GitHub-only package at this stage. You can install is from the GitHub site of the PolMine Project

devtools::install_github("PolMine/biglda")

Getting started

Computing a topic model may require a lot of memory. It is necessary to define the memory available for the Java Virtual Machine (JVM) that will be used to interface to Java before the JVM is started. As the biglda package will initialize a JVM upon being loaded, the respective parameter needs to be set before we load biglda.

options(java.parameters = "-Xmx4g")
library(biglda)

The mallet Java package is not shipped with the R package biglda, to keep size of the R package minimal, and to avoid a complicated situation of licenses. Load the mallet Java package if it is not yet installed.

if (!mallet_is_installed()) mallet_install()

To check the installation, get the version of mallet.

mallet_get_version()

Sample Workflow

The biglda package works seamlessly with the polmineR package. In our example, we compute a topic model for the speeches in the GERMAPARLMINI sample corpus which is included in the polmineR package.

library(polmineR)
use("polmineR")
speeches <- corpus("GERMAPARLMINI") %>%
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "date")

Before training the topic model, mallet requires that the input data are converted into the mallet input format, an InstanceList.

instance_list <- as.instance_list(speeches)

We now instantiate the BigTopicModel class.

lda <- BigTopicModel(n_topics = 25L, alpha_sum = 5.1, beta = 0.1)

The methods for this class are not defined by the biglda package and are not documented therein, but on the documentation of the mallet api.

The first next step is to feed the data to be modelled, i.e. the instance_list into the class.

lda$addInstances(instance_list)

Then we set a additional parameters. In this vein, we also set the number of threads to be used.

lda$setNumThreads(1L)
lda$setTopicDisplay(50L, 10L)
lda$setNumIterations(150L)

Use the estimate method to trigger the training of the topic model.

lda$estimate()

There is a bandwith of tools available for the LDA topicmodels trained using the topicmoddels package. To be able to use it, use the as_LDA() function for type conversion.

topicmodels_lda <- as_LDA(lda)

Evaluating topic models

This is a minimal example to demonstrate a workflow for fitting a set of LDA topic models, and evaluating the models. Note that given the stochastic nature of LDA topic models, results are different each time this vignette is built.

library(parallel)
library(stringi)

We use the REUTERS corpus included in the RcppCWB package here to provide a minimal working example.

use("RcppCWB", corpus = "REUTERS")

Given the very limited size of the corpus (r size(corpus("REUTERS")) tokens / r length(s_attributes("REUTERS", "id")) documents) , results obtained here are not robust.

cores_to_use <- detectCores() - 1L # use all available cores but one

discard <- c(tm::stopwords("en"), capitalize(tm::stopwords("en")))

instance_list <- corpus("REUTERS") %>%
  split(s_attribute = "id") %>%
  get_token_stream(p_attribute = "word", subset = {!word %in% discard}) %>%
  lapply(tolower) %>%
  sapply(stri_c, collapse = "\n") %>% # stri_c() is faster than paste0()
  as.instance_list()

reuters_metrics_list <- lapply(
  seq(from = 6, to = 30, by = 2),
  function(k){
    message("... fitting model for k: ", k)
    BTM <- BigTopicModel(
      instances = instance_list,
      n_topics = k,
      alpha_sum = 5.1,
      beta = 0.1,
      iterations = 1000L,
      threads = cores_to_use,
      silent = TRUE
    )
    BTM$estimate()
    lda <- as_LDA(BTM, verbose = FALSE)
    metrics(lda, verbose = FALSE)
  }
)
reuters_metrics <- do.call(rbind, reuters_metrics_list)

plot(reuters_metrics)