mi_topic: Mutual information of words and documents in a topic

mi_topicR Documentation

Mutual information of words and documents in a topic

Description

Calculates the mutual information of words and documents within a given topic. This measures the degree to which the estimated distribution over words within the topic violates the assumption that it is independent of the distribution of words over documents.

Usage

mi_topic(m, k, groups = NULL)

Arguments

m

mallet_model object with sampling state loaded via load_sampling_state

k

topic number (calculations are only done for one topic at a time)

groups

optional grouping factor for documents. If omitted, the MI over documents is calculated.

Details

The mutual information is given by

MI(W, D|K=k) = ∑_{w, d} p(w, d|k) \log\frac{p(w, d|k)}{p(w|k) p(d|k)}

In the limit of true independence, the fraction in the log is one and the MI is zero. In general, we can rewrite the sum as

∑_d p(d|k) ∑_w p(w|d, k) \log\frac{p(w|d, k)}{p(w|k)}

which is E_D(KL(W|d, W), the expected divergence of the conditional distribution from the marginal distribution. It can be shown with some algebra that

MI(W, D|k) = ∑_{w} p(w|k) IMI(w|k)

where the IMI is defined as specified in the Details for imi_topic. This is the formula used for calculation here.

We can replace D with a grouping over documents and the formulas carry over without further change, now expressing the mutual information of those groupings and words within the topic.

Value

a single value, giving the estimated mutual information.

See Also

imi_topic, imi_check, mi_check


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.