knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 
set.seed(params$id)

This vignette applies topic alignment to data generated from a true LDA model. This corresponds to the first simulation in the manuscript accompanying this package. The arguments to this vignette (which can be modified in the original rmarkdown's YAML) are,

The packages used in this vignette are given below. Note that we also load simulation functions from an external github repository (not this package). This provides the simulate_lda function used to generate the data for this example.

library(MCMCpack)
library(alto)
library(tidyverse)
source("https://raw.githubusercontent.com/krisrs1128/topic_align/main/simulations/simulation_functions.R")
my_theme()

The block below simulates data x according to an LDA models with parameters specified above. The topics are relatively sparse, with $\lambda_{\beta} = 0.1$ and $\lambda_{\gamma} = 0.5$. Each sample has 10,000 counts.

attach(params)
lambdas <- list(beta = 0.1, gamma = .5, count = 1e4)
betas <- rdirichlet(K, rep(lambdas$beta, V))
gammas <- rdirichlet(N, rep(lambdas$gamma, K))
x <- simulate_lda(betas, gammas, lambda = lambdas$count)

Next, we run the LDA models and compute the alignment.

lda_params <- map(1:n_models, ~ list(k = .))
names(lda_params) <- str_c("K", 1:n_models)
alignment <- x %>%
  run_lda_models(lda_params, reset = TRUE, dir = "./fits/lda_", seed = as.integer(id)) %>%
  align_topics(method = method)

We can plot the flow diagram and compare the height-weight words across topics.

plot(alignment)
plot_beta(alignment, c(2, 4))

We can plot measures of topic quality across $m$. The introduction of a low quality topic at $K = 4$ is consistent with the fact that there are only 3 true topics in this data.

ggplot(topics(alignment), aes(m, coherence)) +
  geom_point(alpha = 0.5)

ggplot(topics(alignment), aes(m, refinement)) +
  geom_point(alpha = 0.5)

We can also compute the number of key topics associated with the alignment.

key_topics <- compute_number_of_paths(alignment) %>%
  mutate(id = id)

Finally, we save all the results (if wanted). The distinct IDs allow us to gather alignments from across many replicates, and these are what are shown in the simulation section of the manuscript.

id_vars <- params[c("out_dir", "method", "id", "N", "V", "K")]
scores <- topics(alignment) %>%
  mutate(id = id)

if (save) {
  dir.create(out_dir, recursive = TRUE)
  write_csv(key_topics, save_str("key_topics", id_vars))
  write_csv(scores, save_str("topics", id_vars))
  exper <- list(x, betas, gammas, alignment)
  save(exper, file = save_str("exper", id_vars, "rda"))
}
if (perplexity & save) {
  x_new <- simulate_lda(betas, gammas, lambda = lambdas$count)
  perplexities <- matrix(nrow = n_models - 1, ncol = 2, dimnames = list(NULL, c("train", "test")))
  for (k in seq(2, n_models)) {
    load(str_c("fits/lda_K", k, ".Rdata"))
    perplexities[k - 1, 1] <- topicmodels::perplexity(tm, x)
    perplexities[k - 1, 2] <- topicmodels::perplexity(tm, x_new)
  }

  cbind(K = seq(2, n_models), perplexities) %>%
    as_tibble() %>%
    write_csv(save_str("perplexity", id_vars))
}


lasy/alto documentation built on June 23, 2024, 6:45 a.m.