In koheiw/seededlda: Seeded Sequential LDA for Topic Modeling

knitr::opts_chunk$set(collapse = FALSE, comment = "#>", 
                      fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)

Seeded LDA (Latent Dirichlet Allocation) can identify pre-defined topics in the corpus with a small number of seed words. Seeded LDA is useful when you want to match topics with theoretical concepts in deductive analysis.

Preperation

We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.

library(seededlda)
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)

We will use seed words in a dictionary to define the topics.

dict <- dictionary(file = "dictionary.yml")
print(dict)

Seeded LDA

The function does not have k because it determines the number of topics based on the keys. You can use the distributed algorithm batch_size = 0.01 and convergence detection auto_iter = TRUE to speed up analysis.

lda_seed <- textmodel_seededlda(dfmt, dict, batch_size = 0.01, auto_iter = TRUE,
                                verbose = TRUE)

knitr::kable(terms(lda_seed))

Seeded LDA with residual topics

Seeded LDA can have both seeded and unseeded topics. If residula = 2, two unseeded topics are added to the model. You can change the name of these topics through options(seededlda_residual_name).

lda_res <- textmodel_seededlda(dfmt, dict, residual = 2, batch_size = 0.01, auto_iter = TRUE,
                                verbose = TRUE)

knitr::kable(terms(lda_res))

References

Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. 2011 IEEE 11th International Conference on Data Mining Workshops, 81–88.
Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605

koheiw/seededlda documentation built on Jan. 23, 2025, 3:14 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com