imi_topic: Instantaneous mutual information of words and documents in a...

imi_topicR Documentation

Instantaneous mutual information of words and documents in a topic

Description

Calculates the instantaneous mutual information (IMI) for words and documents within a given topic. This measures the degree to which words assigned to that topic are independently distributed over documents. With a specified document-grouping groups, this instead measures the degree to which words are distributed independently over those groups of documents.

Usage

imi_topic(m, k, words = vocabulary(m), groups = NULL)

Arguments

m

mallet_model object with sampling state loaded via load_sampling_state

k

topic number (calculations are only done for one topic at a time)

words

vector of words to calculate IMI values for.

groups

optional grouping factor with one element for each document. If not NULL, IMIs are calculated over document groups rather than documents.

Details

In ordinary LDA, the distribution of words over topics is independent of documents: that is, in the model's assignment of words to topics, knowing which document a word is in shouldn't tell you anything about more about that word than knowing its topic. In practice, this independence assumption is always violated by the estimated topics. For a given topic k, the IMI measures a given word's contribution to this violation as

H(D|K=k) - H(D|W=w, K=k)

where H denotes the entropy; i.e., the IMI is calculated as

-∑_d p(d|k) \log p(d|k) + ∑_d p(d|w, k) \log p(d|w, k)

The probabilities are simply found from the counts of word tokens within documents d assigned to topic k, as these are recorded in the final Gibbs sampling state.

The overall independence violation for topic k is the expectation of this quantity over words in that topic,

∑_w p(w|k) (H(D|k) - H(D|w, k))

For obtaining the sum, see mi_topic.

If a grouping factor groups is given, the IMI is instead taken not over documents but over groups of documents For example, suppose that the documents are articles drawn from three different periodicals; we might measure the degree to which knowing which periodical the document comes from tells us about which words have been assigned to the topic. Sampled word counts are simply summed over the document groups and then the calculation proceeds with groups in place of documents d in the formulas above.

Value

a vector of scores, in the same order as w

References

Mimno, D., and Blei, D. 2011. Bayesian Checking for Topic Models. Empirical Methods in Natural Language Processing. http://www.cs.columbia.edu/~blei/papers/MimnoBlei2011.pdf.

See Also

mi_topic, calc_imi_topic for the calculation

Examples

## Not run: 
# obtain imi scores for a topic's top words
library(dplyr)
k <- 15
top_words(m, n=10) %>%
    filter(topic == k) %>%
    mutate(imi=imi_topic(m, k, word))

## End(Not run)



agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.