imi_check: Posterior predictive checking for individual words

imi_checkR Documentation

Posterior predictive checking for individual words

Description

This function provides a way to check the fit of the topic model at the individual words-in-topics level by comparing the obtained instantaneous mutual information for those words to scores derived from simulations from the posterior. Large deviations from simulated values may indicate a poorer fit. In particular, large negative deviations indicate words which are more uniformly distributed across documents that the model expects (e.g., boilerplate text appearing in every document), and large positive deviations indicate words which are more sharply localized than the model expects.

Usage

imi_check(m, k, words, groups = NULL, n_reps = 20)

Arguments

m

mallet_model object with sampling state loaded via load_sampling_state

k

topic number (calculations are only done for one topic at a time)

words

vector of words to calculate IMI values for

groups

optional grouping factor for documents. If supplied, the IMI values will be for words over groups rather than over individual documents

n_reps

number of simulations

Details

For a given topic k, a simulation draws a new term-document matrix from the posterior for d. Since a topic is simply a multinomial distribution over the words, for a given document d we simply draw the same number of samples from this multinomial as there were words allocated to topic k in d in the model we are checking. Under the assumptions of the model, this is how the distribution p(w, d|k) arises. With this simulated topic-specific term-document matrix in hand, we recalculate the IMI scores for the given words. The process is replicated to obtain a reference distribution to compare the values from imi_topic to.

A reasonable way to make the comparison is to standardize the "actual" IMI values by the mean and standard deviation of the simulated values. Mimno and Blei (2011) call this the "deviance" measure, recommending over p values because the latter are likely to vanish.

Value

a data frame with word, imi, and deviance columns. The latter is the IMI standardized by the mean and standard deviation of the simulated values. The matrix of simulated values (one row per word) is available as the "simulated" attribute of the returned data frame.

References

Mimno, D., and Blei, D. 2011. Bayesian Checking for Topic Models. Empirical Methods in Natural Language Processing. http://www.cs.columbia.edu/~blei/papers/MimnoBlei2011.pdf.

See Also

imi_simulate for just the simulation results, mi_check, imi_topic


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.