knitr::opts_chunk$set(collapse = TRUE, comment = "#>", 
                      fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)

Distributed LDA (Latent Dirichlet Allocation) can dramatically speeds up your analysis by using multiple processors on your computer. The number of topic is small in this example, but the distributed algorithm is highly effective in identifying many topics (k > 100) in a large corpus.

Preperation

We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.

library(seededlda)
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)

Distributed LDA

When batch_size = 0.01, the distributed algorithm allocates 1% of the documents in the corpus to each processor. It uses all the processors by default, but you can limit the number through options(seededlda_threads).

lda_dist <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, verbose = TRUE)

Despite the much shorter execution time, it identifies topic terms very similar to the standard LDA.

knitr::kable(terms(lda_dist))

Distributed LDA with convergence detection

By default, the algorithm fits LDA through as many as 2000 iterations for reliable results, but we can minimize the number using the convergence detection mechanism to further speed up your analysis. When auto_iter = TRUE, the algorithm stop inference on convergence (delta < 0) and return the result.

lda_auto <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, auto_iter = TRUE,
                          verbose = TRUE)
knitr::kable(terms(lda_auto))

References



koheiw/seededlda documentation built on Jan. 23, 2025, 3:14 p.m.