knitr::opts_chunk$set(collapse = FALSE, comment = "#>", 
                      fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)

seededlda was created mainly for semi-supervised topic modeling but it can perform unsupervised topic modeling too. I explain the basic functions of the package taking unsupervised LDA (Latent Dirichlet Allocation) as an example in this page and discuss semi-supervised LDA in a separate page.

Preperation

We use the corpus of Sputnik articles about Ukraine in the examples. In the preprocessing, we remove grammatical words stopwords("en"), email addresses "*@* and words that occur in more than 10% of documents max_docfreq = 0.1 from the document-feature matrix (DFM).

if (!file.exists("data_corpus_sputnik2022.rds")) {
    download.file("https://www.dropbox.com/s/abme18nlrwxgmz8/data_corpus_sputnik2022.rds?dl=1",
                  "data_corpus_sputnik2022.rds", mode = "wb")
}
library(seededlda)
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)

Standard LDA

You can fit LDA on the DFM only by setting the number of topics k = 10 to identify. When verbose = TRUE, it shows the progress of the inference through iterations. It takes long time to fit LDA on a large corpus, but the distributed algorithm will speed up your analysis dramatically.

lda <- textmodel_lda(dfmt, k = 10, verbose = TRUE)

Topic terms

Once the model is fit, you can can interpret the topics by reading the most salient words in the topics. terms() shows words that are most frequent in each topic at the top of the matrix.

knitr::kable(terms(lda))

Document topics

You can also predict the topics of documents using topics(). I recommend extracting the document variables from the DFM in the fitted object lda$data and saving the topics in the data.frame.

dat <- docvars(lda$data)
dat$topic <- topics(lda)
knitr::kable(head(dat[,c("date", "topic", "head")], 10))

References



koheiw/seededlda documentation built on Jan. 23, 2025, 3:14 p.m.