In paskn/topicl: Tools for Finding Stable Topics and Analyzing them

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

#mods <- readRDS("../mod.RDS")

This vignette shows how to use topicl to find stable (reproducible) topics with Jaccard similarity. This method is presented in (Koltsov et al., 2018), but can be shortly summarized as follows.

A stable topic is a topic that persists across fits of a topic model to the data. Thus, keeping model parameters fixed, a stable topic must appear in a majority of fits. Across fits, we compare topics to each other to test whether a given pair is the same. For comparison, we use compute Jaccard similarity on N top-words for a topic. Papers by Koltosov and colleagues advise fitting at least five models and consider a given pair of topics identical if their Jaccard similarity is >= 90%. Then, if a stable topic persisted across at least 3 out of 5 model runs, you can consider this topic stable.

In this vignette, we will fit five STM models to poliblog5k (see ?stm::poliblog5k for details) and consider how many topics persists using topicl.

library(topicl)
library(stm)
library(tictoc)
library(furrr)
library(dplyr)

fit_stm specifies STM parameters. We increase the maximum number of iterations for the EM algorithm and ensure that each iteration is performed with max_it = 1000 and emtol = 0 - see ?stm for details. In my experience, with an increased number of iterations on these data, topic similarity across fits also increases. I cannot say if K = 25 is optimal for these data, but this number of topics is used in STM documentation.

With https://random.org, we generated five random seeds for each model.

fit_stm <- function(seed) {
  K = 25
  max_it = 1000
  emtol = 0

  mod <- stm(poliblog5k.docs, 
            poliblog5k.voc, K=K,            
            data=poliblog5k.meta,
            max.em.its=max_it,
            emtol = emtol,
            init.type="Random",
            seed = seed,
            verbose = F)
  return(mod)
}


seeds <- c(9934,9576,1563,3379,8505)

With furrr::future_map(), we fit five models with different seeds in parallel. Given the high number of iterations, it can take up to 40 minutes.

tic()
plan(multisession, workers = 5) # toggle multiple parallel processes
mods <- future_map(seeds, fit_stm,                   
                   .options = furrr_options(seed=T))
plan(sequential) # toggle back to sequential
toc()

We use topicl::compare_solutions() to calculate topic similarity on the level of depth = 10 top terms for a topic. By default, top terms are produced by ranking term probabilities. Please note that (Koltsov et al., 2018) advice using 100 top terms.

results <- compare_solutions(mods, depth = 10)

compare_solutions() produces a table with showing the results of comparisons. Here is how to read this output: each row shows a comparison of a topic A (column topic_id_A) from model A (column model_id_A) to topic B from model B (topic_id_B and model_id_B).

results |> 
  arrange(desc(jaccard)) |> 
  head(10)

If model and topic IDs from the table above are confusing, they can be matched back to mods, which is a list containing models in the order of model IDs from the compare_solutions().

print(mods)

Topic IDs match IDs inside each model. To manually check the terms used for comparison of a given pair, you can use top_terms(). Let's compare the terms from topic 13 of model 1 to topic 1 from model 4, which showed 100% Jaccard similarity in the output of compare_solutions().

top_terms(mods[[1]], topic_id = 13,n_terms = 10)

top_terms(mods[[4]], topic_id = 1,n_terms = 10)

Similar topics persisting across fits is helpful to visualize with a network plot. Given output of topicl::compare_solutions() and a threshold, topicl::viz_comparisons() will plot a topic-topic network. Here, we plot connections between topics >= 80% similarity.

viz_comparisons(results, 0.8)

With topicl::filter_topics(), we can get a table with topic IDs and counts of their distinct pairs in other models.

filter_topics(results, 0.5, 2)

References

Koltsov, S., Pashakhin, S., & Dokuka, S. (2018). A Full-Cycle Methodology for News Topic Modeling and User Feedback Research. In S. Staab, O. Koltsova, & D. I. Ignatov (Eds.), Social Informatics (Vol. 11185, pp. 308–321). Springer International Publishing. https://doi.org/10.1007/978-3-030-01129-1_19