knitr::opts_chunk$set(collapse = FALSE, comment = "#>", fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)
seededlda was created mainly for semi-supervised topic modeling but it can perform unsupervised topic modeling too. I explain the basic functions of the package taking unsupervised LDA (Latent Dirichlet Allocation) as an example in this page and discuss semi-supervised LDA in a separate page.
We use the corpus of Sputnik articles about Ukraine in the examples. In the preprocessing, we remove grammatical words stopwords("en")
, email addresses "*@*
and words that occur in more than 10% of documents max_docfreq = 0.1
from the document-feature matrix (DFM).
if (!file.exists("data_corpus_sputnik2022.rds")) { download.file("https://www.dropbox.com/s/abme18nlrwxgmz8/data_corpus_sputnik2022.rds?dl=1", "data_corpus_sputnik2022.rds", mode = "wb") }
library(seededlda) library(quanteda) corp <- readRDS("data_corpus_sputnik2022.rds") toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_url = TRUE) dfmt <- dfm(toks) |> dfm_remove(stopwords("en")) |> dfm_remove("*@*") |> dfm_trim(max_docfreq = 0.1, docfreq_type = "prop") print(dfmt)
You can fit LDA on the DFM only by setting the number of topics k = 10
to identify. When verbose = TRUE
, it shows the progress of the inference through iterations. It takes long time to fit LDA on a large corpus, but the distributed algorithm will speed up your analysis dramatically.
lda <- textmodel_lda(dfmt, k = 10, verbose = TRUE)
Once the model is fit, you can can interpret the topics by reading the most salient words in the topics. terms()
shows words that are most frequent in each topic at the top of the matrix.
knitr::kable(terms(lda))
You can also predict the topics of documents using topics()
. I recommend extracting the document variables from the DFM in the fitted object lda$data
and saving the topics in the data.frame.
dat <- docvars(lda$data) dat$topic <- topics(lda)
knitr::kable(head(dat[,c("date", "topic", "head")], 10))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.