tw_smooth_normalize: Scoring methods for words in topics

tw_smooth_normalizeR Documentation

Scoring methods for words in topics

Description

The "raw" final sampling state of words in topics may be transformed into either estimated probabilities or other kinds of salience scores. These methods produce functions that operate on a topic-word matrix. They can be passed as the weighting parameter to top_words.

Usage

tw_smooth_normalize(m)

tw_smooth(m)

tw_blei_lafferty(m)

tw_sievert_shirley(m, lambda = 0.6)

Arguments

m

a mallet_model object

lambda

For sievert_shirley, the weighting parameter λ, by default 0.6.

Details

The basic method (tw_smooth_normalize) is to recast the sampled word counts as probabilities by adding the estimated hyperparameter β and then normalizing rows so they add to 1. This is equivalent to mallet.topic.words with smooth and normalize set to TRUE. Naturally this will not change the relative ordering of words within topics.

tw_smooth simply adds β but does not normalize.

A method that can re-rank words has been given by Blei and Lafferty: the score for word v in topic t is

p(t,v) log(p(t,v) / ∏_k p(k,v)^(1/K))

where K is the number of topics. The score gives more weight to words which are ranked highly in fewer topics.

Another method is the "relevance" score of Sievert and Shirley: in this case the score is given by

λ log(p(t,v) + (1 - λ) log(p(t,v) / p(v)

where λ is a weighting parameter which is by default set to 0.6 and which determines the amount by which words common in the whole corpus are penalized.

Value

a function of one variable, to be applied to the topic-word sparse matrix.

References

D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009. http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf.

C. Sievert and K.E. Shirley. LDAvis: A method for visualizing and interpreting topics. http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf.

Examples

## Not run: top_words(m, n=10, weighting=tw_blei_lafferty(x))
## Not run: tw_smooth_normalize(m)(topic_words(m))


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.