knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
This vignette describes the main functions of the alto package on a simple
simulated dataset. We will introduce,
run_lda_models for fitting a sequence of LDA models to a fixed datasetalign_topics for aligning topics across a set of fitted modelsplot and plot_beta for displaying the resultsFirst, we load some packages that are used in this vignette.
library(alto) library(purrr) library(stringr) set.seed(123)
Next, we simulate a dataset (50 samples with 20 dimensions each) samples and run
a sequence of LDA models to it. The data, x, are all independent multinomials
and there is no low-rank structure for any models to learn. Nonetheless, it is sufficient for illustrating the package.
The arguments to run_lda_models are the matrix x and a list of lists,
lda_params, which specifies how hyperparameters are to be varied across the
sequence of models. The $i^{th}$ element of the list contains hyperparameters
that will be used in the $i^{th}$ model fit. Any hyperparmeters that are
accepted by topicmodels::LDA can be passed into these internal lists. We use
map to construct a list whose $i^{th}$ element specifies that the $i^{th}$ LDA
model should fit $k = i$ topics.
x <- rmultinom(20, 500, rep(0.1, 50)) colnames(x) <- seq_len(ncol(x)) lda_params <- setNames(map(1:10, ~ list(k = .)), 1:10) lda_models <- run_lda_models(x, lda_params)
The output of this function is a list of two element lists. The first element are the mixed-membership estimates $\gamma$. This is an $N \times K$ matrix specifying the membership of each sample across the $K$ topics. THe second element are the estimated topic log-probabilities, $\log\beta$. This is a $K \times D$ matrix whose rows give the log probabilities for each dimension within that topic.
Given this output, we can compute a tpoic alignment. This is done by the
align_topics function. By default, the product method is used to compute
weights (below, we show how to use the transport or more custom methods).
This function returns an object of class alignment. Its print method shows the
topic weights between pairs of topics across models. The first two columns give
the index $m$ of the models from the input. The next two columns give the index
$k$ of topics within those models. For example, the first row gives the
alignment between the first topics estimated in the $K = 1$ and $K = 2$ models.
Since there is no low-rank structure in this model, all outgoing weights from a
given topic are equal.
result <- align_topics(lda_models) result
We can access the full weights (not just the first few rows), by using the
weights method.
weights(result)
The align_topics function also computes a few per-topic measures. These can be
accessed using topics. For example, mass describes the total mass $\sum_{i =
1}^{n}\gamma_{ik}^{m}$ for topic $k$ in model $m$. The prop column normalizes
the mass column according to the total mass within that model. branch
specifies which overall branch each topic belongs to (this corresponds to colors
in the flow diagrams below). Coherence and refinement are complementary measures
of topic quality. For details on their properties, please refer to the
manuscript accompanying this package.
topics(result)
An alternative measure is the key_topics measure. This looks at how many
topics have similar descendents across resolution levels, and it can be accessed
using the compute_number_of_key_topics function,
compute_number_of_paths(result, plot=TRUE)
The information in the weights and topics functions can be displayed
visually using the alignment class' plot method. Each column in the flow
diagram corresponds to one model, and each rectangle gives a topic. The height
of each rectangle gives the topic mass defined above. The size of the edges
between rectangles corresponds to the weight of that edge in the alignment. In
this multinomial dataset, the perfect "fanning" structure suggests that there is
no topic structure -- there are no emergent branches. This makes sense, since
there is no low-rank structure when simulating a 50-dimensional multinomial.
plot(result)
By default, plot shades each topic and edge in according to its branch
membership. We could alternatively color by refinement, robustness, or topic ID.
plot(result, color_by = "refinement") plot(result, color_by = "coherence") plot(result, color_by = "topic")
To understand the content of the topics, we can use plot_beta. This shows the
probabilities $\beta_{kd}^{m}$ across topics and models. Dimensions are sorted
from those with the highest distinctiveness across topics to those with the
lowest. By default, all dimensions $d$ with at least one $\beta_{kd}^{m} >
0.001$ will be displayed. This can be adjusted using the min_beta parameter.
For clarity, circles with $\beta_{kd}^{m} < \text{min_beta}$ are not shown.
plot_beta(result) plot_beta(result, threshold = 0.05)
Alternatively, we can filter the number of dimensions shown using the
n_features parameter.
plot_beta(result, n_features = 5)
Finally, we can visualize topics associated with subsets of models, either by
specifying the name of that model in the lda_models list or giving the model
index.
plot_beta(result, c(2, 5, 10)) plot_beta(result, c("2", "5", "10")) plot_beta(result, "last")
We can compute an alignment using the transport approach using the method
argument.
result <- align_topics(lda_models, method = "transport") plot(result)
align_topics computes an alignment sequentially across the input gamma and
beta lists. In principle, an alignment can be computed between arbitrary pairs
of topics. This functionality is supported by the align_graph function. For
example, we can fit an alignment between all pairs of topics, across all models.
edges <- alto:::setup_edges("all", names(lda_models)) gamma <- map(lda_models, ~ .$gamma) beta <- map(lda_models, ~ .$beta) align_graph(edges, gamma, beta, transport_weights)
Finally, arbitrary weight methods can be passed to align_graph. The only
requirement is that, given pairs of matrices of mixed-memberships and log-topics
between two models, the function must return a data.frame of weights between all
pairs of topics. The transport_weights function above is an example,
transport_weights(gamma[1:2], beta[1:2])
Here is a dummy example that always returns a weight of 0 between all pairs of topics.
dummy_weights <- function(gamma, beta) { zeros <- matrix(0, ncol(gamma[[1]]),nrow(gamma[[2]])) alto:::.lengthen_weights(data.frame(zeros)) } align_graph(edges, gamma, beta, weight_fun = dummy_weights)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.