imi_check: Posterior predictive checking for individual words
In agoldst/dfrtopics: Tools for exploring topic models of text

imi_check

R Documentation

Posterior predictive checking for individual words

Description

This function provides a way to check the fit of the topic model at the individual words-in-topics level by comparing the obtained instantaneous mutual information for those words to scores derived from simulations from the posterior. Large deviations from simulated values may indicate a poorer fit. In particular, large negative deviations indicate words which are more uniformly distributed across documents that the model expects (e.g., boilerplate text appearing in every document), and large positive deviations indicate words which are more sharply localized than the model expects.

Usage

imi_check(m, k, words, groups = NULL, n_reps = 20)

Arguments

`m`	`mallet_model` object with sampling state loaded via `load_sampling_state`
`k`	topic number (calculations are only done for one topic at a time)
`words`	vector of words to calculate IMI values for
`groups`	optional grouping factor for documents. If supplied, the IMI values will be for words over groups rather than over individual documents
`n_reps`	number of simulations

Details

For a given topic k, a simulation draws a new term-document matrix from the posterior for d. Since a topic is simply a multinomial distribution over the words, for a given document d we simply draw the same number of samples from this multinomial as there were words allocated to topic k in d in the model we are checking. Under the assumptions of the model, this is how the distribution p(w, d|k) arises. With this simulated topic-specific term-document matrix in hand, we recalculate the IMI scores for the given words. The process is replicated to obtain a reference distribution to compare the values from imi_topic to.

A reasonable way to make the comparison is to standardize the "actual" IMI values by the mean and standard deviation of the simulated values. Mimno and Blei (2011) call this the "deviance" measure, recommending over p values because the latter are likely to vanish.

Value

a data frame with word, imi, and deviance columns. The latter is the IMI standardized by the mean and standard deviation of the simulated values. The matrix of simulated values (one row per word) is available as the "simulated" attribute of the returned data frame.

References

Mimno, D., and Blei, D. 2011. Bayesian Checking for Topic Models. Empirical Methods in Natural Language Processing. http://www.cs.columbia.edu/~blei/papers/MimnoBlei2011.pdf.

agoldst/dfrtopics
Tools for exploring topic models of text

imi_check: Posterior predictive checking for individual words
In agoldst/dfrtopics: Tools for exploring topic models of text

Posterior predictive checking for individual words

Description

Usage

Arguments

Details

Value

References

See Also

Related to imi_check in agoldst/dfrtopics...

R Package Documentation

Browse R Packages

We want your feedback!

agoldst/dfrtopics Tools for exploring topic models of text

imi_check: Posterior predictive checking for individual words In agoldst/dfrtopics: Tools for exploring topic models of text

Posterior predictive checking for individual words

Description

Usage

Arguments

Details

Value

References

See Also

Related to imi_check in agoldst/dfrtopics...

R Package Documentation

Browse R Packages

We want your feedback!

agoldst/dfrtopics
Tools for exploring topic models of text

imi_check: Posterior predictive checking for individual words
In agoldst/dfrtopics: Tools for exploring topic models of text