In MansMeg/tidytopics: Tidy Topic Modeling Toolbox for R

# Should the full vignette be run or a fast version.
# The full version run all code and store the results to run faster later.
run_full_vignette <- TRUE

Introduction

Bayesian checking of topic models has been proposed by Mimno and Blei (2011) as a way of study how well a model fits data and to identify potential areas of model improvements. For more details, see Mimno and Blei (2011) here.

We start out by loading a trained topic model with 50 topics on the State of the Union Addresses.

library(tidytopics)
library(dplyr)
library(ggplot2)
data(sotu50)
sotu50

The next step is to extract the top terms. In this example we use the standard metric, the probability of a type given a topic, to identify the top 10 terms by topic. See ?top_term for details and other key term extractions methods.

top_keys <- top_terms(sotu50, "type_probability")
top_keys

We start out by calculate the mutual information between types and documents for each topic $MI(W,g|k)$ using the mi() function. The default is to use documents as grouping variable ($MI(W,D|k)$).

observed_mi <- mi(sotu50)
ggplot(data = observed_mi, aes(x = mi)) + geom_histogram(bins = 20)

To calculate the instantaneous mutual information ($IMI(w,g|k)$) we use the imi() function. We can calulate the IMI for all terms, but subseting to the most interesting terms speed up the computations. As with MI, the default grouping is documents ($IMI(w,D|k)$).

observed_imi <- imi(state = sotu50, w = top_keys)
observed_imi

We can plot the IMI values for the top key terms using the ggplot_imi_type() function.

plt <- ggplot_imi_type(observed_imi, topic = 1)
plt

The objects is a ggplot2 objects so we can easily change it however we want. Allthough laying boxplots needs to invert the coordinates so we need to label the y axis with xlab() and vice versa.

Simulate draws of tokens from the posterior

As the next step we want to study how well these values would fit given the actual model. Here this is implemented the same way as in Mimno and Blei (2011), based on one Gibbs draw and using only the sufficient statistic $n_{kw}$ for $\Phi$ to simulate the model parameters (not using $\beta$).

We simulate 10 draws from the model this way by using the function sample_tokens_given_topics() and calculate the imi() and mi() for each draw. This is mainly to save memory, so the calculation take some time to run.

if(run_full_vignette){
  no_draws <- 25
  replicated_imi <- list()
  replicated_mi <- list()
  for(i in 1:no_draws){
    sim_tokens <- sample_tokens_given_topics(sotu50)
    replicated_imi[[i]] <- imi(sim_tokens, w = top_keys)
    replicated_mi[[i]] <- mi(sim_tokens)
  }
  replicated_imi <- bind_rows(replicated_imi)
  replicated_mi <- bind_rows(replicated_mi)
  save(replicated_imi, replicated_mi, file = "bayesian_checking_1.Rdata")
} else {
  load(file = "bayesian_checking_1.Rdata")
}

no_draws <- 25
replicated_imi <- list()
replicated_mi <- list()
for(i in 1:no_draws){
  sim_tokens <- sample_tokens_given_topics(sotu50)
  replicated_imi[[i]] <- imi(sim_tokens, w = top_keys)
  replicated_mi[[i]] <- mi(sim_tokens)
}
replicated_imi <- bind_rows(replicated_imi)
replicated_mi <- bind_rows(replicated_mi)

Using the simulated tokens we can plot the IMI for data and compare it to the IMI of the simulated draws.

plt <- ggplot_imi_type(observed_imi, replicated_imi, topic = 1)
plt

We can now also use the replicated MI to study how the similated deviance is different from the observed value. We look at the MI for topic 1 and can compare it with replicated values.

ggplot(data = filter(replicated_mi, topic == 1), aes(x = mi)) +
  geom_histogram(bins = 100) + 
  geom_vline(xintercept = filter(observed_mi, topic == 1)[["mi"]])

Based on this large difference between observed and replicated MI we can calculate the deviance of each topic as the standardized distance between the replicated values and the true value using the function mi_deviance().

mi_dev <- mi_deviance(observed_mi, replicated_mi)
mi_dev

Searching for systematic deviations

With these tools we can search for systematic discrepancies in the data that is not caught by the model. In this example we study the grouping of the State of the Union Addresses by president and by year. We start by reading in this meta data and add it to the topic model state file.

To compare the results of the presidential grouping we will also create a random grouping into 43 groups.

data(sotu)
doc_pres <- 
  transmute(sotu, doc, president, 
            random_group = sample(1:43, nrow(sotu), replace = TRUE))
sotu50 <- left_join(sotu50, y = doc_pres, by = "doc")

We start out by calculate the grouped IMI and MI by using presidents and the random group as the grouping variable $g$ instead of documents that we used earlier.

observed_mi_president <- mi(sotu50, g = "president")
observed_mi_random_group <- mi(sotu50, g = "random_group")
observed_imi_president <- imi(sotu50, g = "president", w = top_keys)
observed_imi_random_group <- imi(sotu50, g = "random_group", w = top_keys)

We use this grouping to calculate MI on simulated data from the model. Once again this may take some time to calculate.

if(run_full_vignette){
  no_draws <- 25
  replicated_mi_president <- list()
  replicated_mi_random_group <- list()
  replicated_imi_president <- list()
  replicated_imi_random_group <- list()
  for(i in 1:no_draws){
    sim_tokens <- sample_tokens_given_topics(sotu50)
    replicated_mi_president[[i]] <- 
      mi(sim_tokens, g = "president")
    replicated_imi_president[[i]] <- 
      imi(sim_tokens, g = "president", w = top_keys)
    replicated_mi_random_group[[i]] <- 
      mi(sim_tokens, g = "president")
    replicated_imi_random_group[[i]] <- 
      imi(sim_tokens, g = "president", w = top_keys)
  }
  replicated_mi_president <- bind_rows(replicated_mi_president)
  replicated_mi_random_group <- bind_rows(replicated_mi_president)
  replicated_imi_president <- bind_rows(replicated_imi_president)
  replicated_imi_random_group <- bind_rows(replicated_imi_random_group)
  save(replicated_mi_president, replicated_mi_random_group, replicated_imi_president, replicated_imi_random_group, file = "bayesian_checking_2.Rdata")
} else {
  load("bayesian_checking_2.Rdata")
}

no_draws <- 25
replicated_mi_president <- list()
replicated_mi_random_group <- list()
replicated_imi_president <- list()
replicated_imi_random_group <- list()
for(i in 1:no_draws){
  sim_tokens <- sample_tokens_given_topics(sotu50)
  replicated_mi_president[[i]] <- 
    mi(sim_tokens, g = "president")
  replicated_imi_president[[i]] <- 
    imi(sim_tokens, g = "president", w = top_keys)
  replicated_mi_random_group[[i]] <- 
    mi(sim_tokens, g = "president")
  replicated_imi_random_group[[i]] <- 
    imi(sim_tokens, g = "president", w = top_keys)
}
replicated_mi_president <- bind_rows(replicated_mi_president)
replicated_mi_random_group <- bind_rows(replicated_mi_president)
replicated_imi_president <- bind_rows(replicated_imi_president)
replicated_imi_random_group <- bind_rows(replicated_imi_random_group)

Based on the grouped MI calculations we can study how the deviance is distributed and compare it to documents and the random group as the grouping variable.

mi_dev_pres <- 
  mi_deviance(observed_mi_president, replicated_mi_president) %>%
  mutate(g = "president")
mi_dev_rg <- 
  mi_deviance(observed_mi_random_group, replicated_mi_random_group) %>%
  mutate(g = "random_group")
mi_dev <- 
  mi_deviance(observed_mi, replicated_mi) %>%
  mutate(g = "doc")  %>%
  bind_rows(mi_dev_pres, mi_dev_rg)

ggplot(data = mi_dev, aes(x = g, y = deviance)) + 
  geom_boxplot() + 
  coord_flip()

As we can see, the deviance of grouping by president is much higher than a random grouping with equal number of groups. Indicating that presidents is a grouping variable that we should try to account for in modeling.

We can also study the IMI of, as an example, topic 1 using president as the grouping variable compared to using each document and a random grouping.

ggplot_imi_type(observed_imi = observed_imi_president, replicated_imi = replicated_imi_president, topic = 1)

ggplot_imi_type(observed_imi = observed_imi_random_group, replicated_imi = replicated_imi_random_group, topic = 1)

This indicates that the usage of "panama", "canal", "isthmus", "nicaragua" is obviously more common in some preseident addresses, indicating a temporal pattern in these words for this topic. While more general terms such as "central", "america" do not change over time and presidents.

References

Mimno, D., & Blei, D. (2011). Bayesian checking for topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 227-237). Association for Computational Linguistics.