imi_check | R Documentation |
This function provides a way to check the fit of the topic model at the individual words-in-topics level by comparing the obtained instantaneous mutual information for those words to scores derived from simulations from the posterior. Large deviations from simulated values may indicate a poorer fit. In particular, large negative deviations indicate words which are more uniformly distributed across documents that the model expects (e.g., boilerplate text appearing in every document), and large positive deviations indicate words which are more sharply localized than the model expects.
imi_check(m, k, words, groups = NULL, n_reps = 20)
m |
|
k |
topic number (calculations are only done for one topic at a time) |
words |
vector of words to calculate IMI values for |
groups |
optional grouping factor for documents. If supplied, the IMI values will be for words over groups rather than over individual documents |
n_reps |
number of simulations |
For a given topic k, a simulation draws a new term-document matrix from
the posterior for d. Since a topic is simply a multinomial distribution
over the words, for a given document d we simply draw the same number
of samples from this multinomial as there were words allocated to topic
k in d in the model we are checking. Under the assumptions of the
model, this is how the distribution p(w, d|k) arises. With this
simulated topic-specific term-document matrix in hand, we recalculate the IMI
scores for the given words
. The process is replicated to obtain a
reference distribution to compare the values from imi_topic
to.
A reasonable way to make the comparison is to standardize the "actual" IMI values by the mean and standard deviation of the simulated values. Mimno and Blei (2011) call this the "deviance" measure, recommending over p values because the latter are likely to vanish.
a data frame with word
, imi
, and deviance
columns. The latter is the IMI standardized by the mean and standard
deviation of the simulated values. The matrix of simulated values (one row
per word) is available as the "simulated"
attribute of the returned
data frame.
Mimno, D., and Blei, D. 2011. Bayesian Checking for Topic Models. Empirical Methods in Natural Language Processing. http://www.cs.columbia.edu/~blei/papers/MimnoBlei2011.pdf.
imi_simulate
for just the simulation results,
mi_check
, imi_topic
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.