knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)
Distributed LDA (Latent Dirichlet Allocation) can dramatically speeds up your analysis by using multiple processors on your computer. The number of topic is small in this example, but the distributed algorithm is highly effective in identifying many topics (k > 100
) in a large corpus.
We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.
library(seededlda) library(quanteda) corp <- readRDS("data_corpus_sputnik2022.rds") toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_url = TRUE) dfmt <- dfm(toks) |> dfm_remove(stopwords("en")) |> dfm_remove("*@*") |> dfm_trim(max_docfreq = 0.1, docfreq_type = "prop") print(dfmt)
When batch_size = 0.01
, the distributed algorithm allocates 1% of the documents in the corpus to each processor. It uses all the processors by default, but you can limit the number through options(seededlda_threads)
.
lda_dist <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, verbose = TRUE)
Despite the much shorter execution time, it identifies topic terms very similar to the standard LDA.
knitr::kable(terms(lda_dist))
By default, the algorithm fits LDA through as many as 2000 iterations for reliable results, but we can minimize the number using the convergence detection mechanism to further speed up your analysis. When auto_iter = TRUE
, the algorithm stop inference on convergence (delta < 0
) and return the result.
lda_auto <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, auto_iter = TRUE, verbose = TRUE)
knitr::kable(terms(lda_auto))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.