In koheiw/seededlda: Seeded Sequential LDA for Topic Modeling

knitr::opts_chunk$set(collapse = FALSE, comment = "#>", 
                      fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)

seededlda was created mainly for semi-supervised topic modeling but it can perform unsupervised topic modeling too. I explain the basic functions of the package taking unsupervised LDA (Latent Dirichlet Allocation) as an example in this page and discuss semi-supervised LDA in a separate page.

Preperation

We use the corpus of Sputnik articles about Ukraine in the examples. In the preprocessing, we remove grammatical words stopwords("en"), email addresses "*@* and words that occur in more than 10% of documents max_docfreq = 0.1 from the document-feature matrix (DFM).

if (!file.exists("data_corpus_sputnik2022.rds")) {
    download.file("https://www.dropbox.com/s/abme18nlrwxgmz8/data_corpus_sputnik2022.rds?dl=1",
                  "data_corpus_sputnik2022.rds", mode = "wb")
}

library(seededlda)
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)

Standard LDA

You can fit LDA on the DFM only by setting the number of topics k = 10 to identify. When verbose = TRUE, it shows the progress of the inference through iterations. It takes long time to fit LDA on a large corpus, but the distributed algorithm will speed up your analysis dramatically.

lda <- textmodel_lda(dfmt, k = 10, verbose = TRUE)

Topic terms

Once the model is fit, you can can interpret the topics by reading the most salient words in the topics. terms() shows words that are most frequent in each topic at the top of the matrix.

knitr::kable(terms(lda))

Document topics

You can also predict the topics of documents using topics(). I recommend extracting the document variables from the DFM in the fitted object lda$data and saving the topics in the data.frame.

dat <- docvars(lda$data)
dat$topic <- topics(lda)

knitr::kable(head(dat[,c("date", "topic", "head")], 10))

References

Heinrich, G. (2008). Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est.pdf
Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605

koheiw/seededlda documentation built on Jan. 23, 2025, 3:14 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com