This vignette illustrates the use of alignment on a toy strain switching simulation. The arguments to this vignette are
K
: The true number of topics underlying the simulated data. In the
manuscript, $K = 5$.N
: The number of samples (i.e., documents) to simulate. In the manuscript, this is set to 250.S
: The number of species that are subject to strain switching. Larger $S$
makes strain switching easier to identify.V
: The number of dimensions (i.e. vocabulary size) per sample. In the manuscript, this is set to 1000.method
: The alignment strategy to pass to align_topics
.n_models
: The total number of models to fit to the simulated data. In the
manuscript, this is set to 10.out_dir
: If results are saved, where should they be saved to?save
: Should any results be saved?knitr::opts_chunk$set(collapse = TRUE, comment = "#>") set.seed(params$rep + params$S)
The packages used in this simulation are given below. The strain switching data
is generated by the equivalence_data
function below. This function is sourced
by the external R script linked below.
library(alto) library(tidyverse) source("https://raw.githubusercontent.com/krisrs1128/topic_align/main/simulations/simulation_functions.R") my_theme()
We simulate the strain switching data below. The parameters for the underlying LDA model are identical to those in the other vignettes. The parameter $\lambda_{\alpha}$ controls the sparsity of the topics simulated on the subset $S$ of switched species.
attach(params) lambdas <- list(beta = 0.1, gamma = 0.5, count = 1e4, alpha = 0.3) sim_data_ <- equivalence_data(2 * N, V, K, lambdas, S) # train and test sim_data <- sim_data_ sim_data$x <- sim_data_$x[1:N, ] sim_data$gammas <- sim_data_$gammas[1:N, ]
The block below runs a sequence of models and aligns their topics. This is identical to the parallel blocks in other simulations.
lda_params <- map(1:n_models, ~ list(k = .)) names(lda_params) <- str_c("K", 1:n_models) alignment <- sim_data$x %>% run_lda_models(lda_params, reset = TRUE, dir = "fits/switching_", seed = as.integer(rep)) %>% align_topics(method = params$method)
Next, we compute the cosine similarity between the true and the estimated topics. These measurements are used to evaluate the specificity of learned topics at different resolutions. They are shown as a sequence of increasingly fine-grained heatmaps in the accompanying manuscript.
beta_hat <- alto:::plot_beta_layout(alignment)$betas perturbed_topics <- do.call(rbind, sim_data$perturbed_topics) similarities <- cosine_similarity(dplyr::select(beta_hat, -m), perturbed_topics) %>% as_tibble() %>% mutate( m = beta_hat$m, rep = params$rep, S = params$S )
We can visualize the learned topics at several levels of granularity. The similarities between topics 2-4 and 3-5 in model 5 suggest that they emerge from switched versions of the same functional community.
plot_beta(alignment) plot_beta(alignment, c("K3", "K5"))
This interpretation is reinforced by the branching structure of the flow diagram. Strain switched topics tend to share a substantial fraction of their ancestor topic weight.
plot(alignment)
To ease interpretation of strain specificity across a large number of simulations, we compute a summary of the difference in cosine similarities between pairs of switched topics. This data frame is merged across simulations to provide the summary figure in the manuscript.
sensitivity <- data.frame( diff = sum(abs(similarities[, 2] - similarities[, 1]) + abs(similarities[, 4] - similarities[, 3])), subset_size = params$S, rep = params$rep )
Finally, we save both the raw experimental output and the derived summaries.
id_vars <- params[c("out_dir", "method", "S", "rep", "N", "V", "K")] if (params$save) { dir.create(params$out_dir, recursive = TRUE) write_csv(similarities, save_str("similarities", id_vars)) write_csv(sensitivity, save_str("sensitivity", id_vars)) exper <- list(sim_data, alignment) save(exper, file = save_str("exper", id_vars, "rda")) }
if (params$perplexity && params$save) { x_new <- sim_data_$x[(N + 1):(2 * N), ] perplexities <- matrix(nrow = params$n_models - 1, ncol = 2, dimnames = list(NULL, c("train", "test"))) for (k in seq(2, params$n_models)) { load(str_c("fits/switching_K", k, ".Rdata")) perplexities[k - 1, 1] <- topicmodels::perplexity(tm, sim_data$x) perplexities[k - 1, 2] <- topicmodels::perplexity(tm, x_new) } cbind(K = seq(2, params$n_models), perplexities) %>% as_tibble() %>% write_csv(save_str("perplexity", id_vars)) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.