In koheiw/seededlda: Seeded Sequential LDA for Topic Modeling

knitr::opts_chunk$set(collapse = TRUE, comment = "#>", 
                      fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)

Distributed LDA (Latent Dirichlet Allocation) can dramatically speeds up your analysis by using multiple processors on your computer. The number of topic is small in this example, but the distributed algorithm is highly effective in identifying many topics (k > 100) in a large corpus.

Preperation

We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.

library(seededlda)
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)

Distributed LDA

When batch_size = 0.01, the distributed algorithm allocates 1% of the documents in the corpus to each processor. It uses all the processors by default, but you can limit the number through options(seededlda_threads).

lda_dist <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, verbose = TRUE)

Despite the much shorter execution time, it identifies topic terms very similar to the standard LDA.

knitr::kable(terms(lda_dist))

Distributed LDA with convergence detection

By default, the algorithm fits LDA through as many as 2000 iterations for reliable results, but we can minimize the number using the convergence detection mechanism to further speed up your analysis. When auto_iter = TRUE, the algorithm stop inference on convergence (delta < 0) and return the result.

lda_auto <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, auto_iter = TRUE,
                          verbose = TRUE)

knitr::kable(terms(lda_auto))

References

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, 10, 1801–1828.
Watanabe, K. (2023). Speed Up Topic Modeling: Distributed Computing and Convergence Detection for LDA, working paper.

koheiw/seededlda documentation built on Jan. 23, 2025, 3:14 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com