knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
set.seed(params$id)

This vignette demonstrates the use of alignment to characterize topics that are contaminated by a kind of background noise. It corresponds to the second simulation in the manuscript accompanying this package. The arguments to this vignette (which can be modified in the original rmarkdown's YAML) are,

We load packages. The sim_gradient function that generates the contaminated LDA model is sourced from the link below.

library(alto)
library(tidyverse)
library(MCMCpack)
source("https://raw.githubusercontent.com/krisrs1128/topic_align/main/simulations/simulation_functions.R")
my_theme()

The block below simulates from the contaminated LDA model. The $\lambda_{\beta}$ and $\lambda_{\gamma}$ parameters are as in the LDA simulation in the sim_lda.Rmd vignette. The $\lambda_{\nu}$ parameter specifies the hyperparameter of the Dirichlet distribution used in the background noise.

attach(params)
lambdas <- list(beta = 0.1, gamma = 0.5, nu = 0.5, count = 1e4)
sim_data_ <- simulate_gradient(2 * N, K, V, lambdas, alpha = params$alpha)
sim_data <- sim_data_
sim_data$x <- sim_data$x[1:N, ]
sim_data$gamma <- sim_data$gamma[1:N, ]

Next, we run a sequence of LDA models and compute an alignment. This code is identical across vignettes.

lda_params <- map(1:n_models, ~ list(k = .))
names(lda_params) <- str_c("K", 1:n_models)
alignment <- sim_data$x %>%
  run_lda_models(lda_params, reset = TRUE, dir = "./fits/background_",  seed = as.integer(id)) %>%
  align_topics(method = params$method)

Next, we extract the data that summarize the quality of topics emerging from the alignment.

scores <- topics(alignment)  %>%
  mutate(id = params$id)
key_topics <- compute_number_of_paths(alignment)

By saving these data into different directories, we can gather results across a variety of $\alpha$'s.

id_vars <- params[c("out_dir", "method", "alpha", "id", "N", "V", "K")]
if (params$save) {
  dir.create(params$out_dir, recursive = TRUE)
  write_csv(scores, save_str("scores", id_vars))
  write_csv(key_topics, save_str("key_topics", id_vars))
  exper <- list(sim_data, alignment)
  save(exper, file = save_str("exper", id_vars, "rda"))
}
if (params$perplexity && params$save) {
  perplexities <- matrix(nrow = params$n_models - 1, ncol = 2, dimnames = list(NULL, c("train", "test")))
  for (k in seq(2, params$n_models)) {
    load(str_c("fits/background_K", k, ".Rdata"))
    perplexities[k - 1, 1] <- topicmodels::perplexity(tm, sim_data$x)
    perplexities[k - 1, 2] <- topicmodels::perplexity(tm, sim_data_$x[(N + 1):(2 * N), ])
  }

  cbind(K = seq(2, params$n_models), perplexities) %>%
    as_tibble() %>%
    write_csv(save_str("perplexity", id_vars))
}


lasy/alto documentation built on June 23, 2024, 6:45 a.m.