knitr::opts_chunk$set(collapse = TRUE, comment = "#>", 
                      fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)

Sequential LDA (Latent Dirichlet Allocation) can identify topics of sentences more accurately than the standard LDA, because it considers the topics of the preceding sentence to classify the current sentence. The topics of the preceding sentences not only make the transition of topics smoother but help to mitigate the data sparsity.

Preperation

We use the Sputnik corpus on Ukraine, but preparation in this example is slightly different from the introduction. Since Sequential LDA is created for classification of sentences, we apply corpus_reshape() to the corpus to segment the texts

library(seededlda)
library(quanteda)
library(ggmatplot)

corp <- readRDS("data_corpus_sputnik2022.rds") |>
    corpus_reshape(to = "sentences")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

As a result, the DFM has more documents (r ndoc(dfmt)) with sentence number appended to the document names. The sentences make the sparsity of the DFM very high (r round(sparsity(dfmt) * 100, 2)%).

print(dfmt)

Sequential LDA

You can enable the sequential algorithm only by setting gamma = 0.5. If the value is smaller than this, the topics of the previous sentence affect less in classifying the current sentence.

lda_seq <- textmodel_lda(dfmt, k = 10, gamma = 0.5, 
                         batch_size = 0.01, auto_iter = TRUE,
                         verbose = TRUE)
knitr::kable(terms(lda_seq))

Seeded Sequential LDA

You can also enable the sequential algorithm in Seeded LDA only by setting gamma > 0.

dict <- dictionary(file = "dictionary.yml")
print(dict)
lda_seed <- textmodel_seededlda(dfmt, dict, residual = 2, gamma = 0.5, 
                                batch_size = 0.01, auto_iter = TRUE,
                                verbose = TRUE)
knitr::kable(terms(lda_seed))

The plot show the topic probability of the first 100 sentences from an article in the corpus. Thanks to the sequential algorithm, adjacent sentences are classified into the same topics.

ggmatplot(lda_seed$theta[paste0("s1104078647.", 1:100),], 
          plot_type = "line", linetype = 1, 
          xlab = "Sentence", ylab = "Probability", legend_title = "Topic")

References



koheiw/seededlda documentation built on Jan. 23, 2025, 3:14 p.m.