imi_topic | R Documentation |
Calculates the instantaneous mutual information (IMI) for words and documents
within a given topic. This measures the degree to which words assigned to
that topic are independently distributed over documents. With a specified
document-grouping groups
, this instead measures the degree to which
words are distributed independently over those groups of documents.
imi_topic(m, k, words = vocabulary(m), groups = NULL)
m |
|
k |
topic number (calculations are only done for one topic at a time) |
words |
vector of words to calculate IMI values for. |
groups |
optional grouping factor with one element for each document. If not NULL, IMIs are calculated over document groups rather than documents. |
In ordinary LDA, the distribution of words over topics is independent of documents: that is, in the model's assignment of words to topics, knowing which document a word is in shouldn't tell you anything about more about that word than knowing its topic. In practice, this independence assumption is always violated by the estimated topics. For a given topic k, the IMI measures a given word's contribution to this violation as
H(D|K=k) - H(D|W=w, K=k)
where H denotes the entropy; i.e., the IMI is calculated as
-∑_d p(d|k) \log p(d|k) + ∑_d p(d|w, k) \log p(d|w, k)
The probabilities are simply found from the counts of word tokens within documents d assigned to topic k, as these are recorded in the final Gibbs sampling state.
The overall independence violation for topic k is the expectation of this quantity over words in that topic,
∑_w p(w|k) (H(D|k) - H(D|w, k))
For obtaining the sum, see mi_topic
.
If a grouping factor groups
is given, the IMI is instead taken not
over documents but over groups of documents For example, suppose that the
documents are articles drawn from three different periodicals; we might
measure the degree to which knowing which periodical the document comes from
tells us about which words have been assigned to the topic. Sampled word
counts are simply summed over the document groups and then the calculation
proceeds with groups in place of documents d in the formulas above.
a vector of scores, in the same order as w
Mimno, D., and Blei, D. 2011. Bayesian Checking for Topic Models. Empirical Methods in Natural Language Processing. http://www.cs.columbia.edu/~blei/papers/MimnoBlei2011.pdf.
mi_topic
, calc_imi_topic
for the
calculation
## Not run: # obtain imi scores for a topic's top words library(dplyr) k <- 15 top_words(m, n=10) %>% filter(topic == k) %>% mutate(imi=imi_topic(m, k, word)) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.