knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) #mods <- readRDS("../mod.RDS")
This vignette shows how to use topicl
to find stable (reproducible)
topics with Jaccard similarity. This method is presented in (Koltsov
et al., 2018), but can be shortly summarized as follows.
A stable topic is a topic that persists across fits of a topic model to the data. Thus, keeping model parameters fixed, a stable topic must appear in a majority of fits. Across fits, we compare topics to each other to test whether a given pair is the same. For comparison, we use compute Jaccard similarity on N top-words for a topic. Papers by Koltosov and colleagues advise fitting at least five models and consider a given pair of topics identical if their Jaccard similarity is >= 90%. Then, if a stable topic persisted across at least 3 out of 5 model runs, you can consider this topic stable.
In this vignette, we will fit five STM models to poliblog5k
(see
?stm::poliblog5k
for details) and consider how many topics persists
using topicl
.
library(topicl) library(stm) library(tictoc) library(furrr) library(dplyr)
fit_stm
specifies STM parameters. We increase the maximum number of
iterations for the EM algorithm and ensure that each iteration is
performed with max_it = 1000
and emtol = 0
- see ?stm
for
details. In my experience, with an increased number of iterations on these
data, topic similarity across fits also increases. I cannot say if K
= 25
is optimal for these data, but this number of topics is used in
STM documentation.
With https://random.org, we generated five random seeds
for each model.
fit_stm <- function(seed) { K = 25 max_it = 1000 emtol = 0 mod <- stm(poliblog5k.docs, poliblog5k.voc, K=K, data=poliblog5k.meta, max.em.its=max_it, emtol = emtol, init.type="Random", seed = seed, verbose = F) return(mod) } seeds <- c(9934,9576,1563,3379,8505)
With furrr::future_map()
, we fit five models with different seeds in
parallel. Given the high number of iterations, it can take up to 40
minutes.
tic() plan(multisession, workers = 5) # toggle multiple parallel processes mods <- future_map(seeds, fit_stm, .options = furrr_options(seed=T)) plan(sequential) # toggle back to sequential toc()
We use topicl::compare_solutions()
to calculate topic similarity on
the level of depth = 10
top terms for a topic. By default, top terms
are produced by ranking term probabilities. Please note that (Koltsov
et al., 2018) advice using 100 top terms.
results <- compare_solutions(mods, depth = 10)
compare_solutions()
produces a table with showing the results of
comparisons. Here is how to read this output: each row shows a
comparison of a topic A (column topic_id_A
) from model A (column
model_id_A
) to topic B from model B (topic_id_B
and model_id_B
).
results |> arrange(desc(jaccard)) |> head(10)
If model and topic IDs from the table above are confusing, they can be
matched back to mods
, which is a list containing models in the order
of model IDs from the compare_solutions()
.
print(mods)
Topic IDs match IDs inside each model. To manually check the terms
used for comparison of a given pair, you can use top_terms()
. Let's
compare the terms from topic 13 of model 1 to topic 1 from model 4,
which showed 100% Jaccard similarity in the output of
compare_solutions()
.
top_terms(mods[[1]], topic_id = 13,n_terms = 10)
top_terms(mods[[4]], topic_id = 1,n_terms = 10)
Similar topics persisting across fits is helpful to visualize with a
network plot. Given output of topicl::compare_solutions()
and a
threshold, topicl::viz_comparisons()
will plot a topic-topic
network. Here, we plot connections between topics >= 80% similarity.
viz_comparisons(results, 0.8)
With topicl::filter_topics()
, we can get a table with topic IDs and
counts of their distinct pairs in other models.
filter_topics(results, 0.5, 2)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.