knitr::opts_chunk$set(collapse = FALSE, comment = "#>", fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)
Seeded LDA (Latent Dirichlet Allocation) can identify pre-defined topics in the corpus with a small number of seed words. Seeded LDA is useful when you want to match topics with theoretical concepts in deductive analysis.
We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.
library(seededlda) library(quanteda) corp <- readRDS("data_corpus_sputnik2022.rds") toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_url = TRUE) dfmt <- dfm(toks) |> dfm_remove(stopwords("en")) |> dfm_remove("*@*") |> dfm_trim(max_docfreq = 0.1, docfreq_type = "prop") print(dfmt)
We will use seed words in a dictionary to define the topics.
dict <- dictionary(file = "dictionary.yml") print(dict)
The function does not have k
because it determines the number of topics based on the keys. You can use the distributed algorithm batch_size = 0.01
and convergence detection auto_iter = TRUE
to speed up analysis.
lda_seed <- textmodel_seededlda(dfmt, dict, batch_size = 0.01, auto_iter = TRUE, verbose = TRUE)
knitr::kable(terms(lda_seed))
Seeded LDA can have both seeded and unseeded topics. If residula = 2
, two unseeded topics are added to the model. You can change the name of these topics through options(seededlda_residual_name)
.
lda_res <- textmodel_seededlda(dfmt, dict, residual = 2, batch_size = 0.01, auto_iter = TRUE, verbose = TRUE)
knitr::kable(terms(lda_res))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.