Description Usage Arguments Details Value References See Also Examples
View source: R/semanticCoherence.R
Calculate semantic coherence (Mimno et al 2011) for an STM model.
1 | semanticCoherence(model, documents, M = 10)
|
model |
the STM object |
documents |
the STM formatted documents (see |
M |
the number of top words to consider per topic |
Semantic coherence is a metric related to pointwise mutual information that was introduced in a paper by David Mimno, Hanna Wallach and colleagues (see references), The paper details a series of manual evaluations which show that their metric is a reasonable surrogate for human judgment. The core idea here is that in models which are semantically coherent the words which are most probable under a topic should co-occur within the same document.
One of our observations in Roberts et al 2014 was that semantic coherence alone is relatively easy to
achieve by having only a couple of topics which all are dominated by the most common words. Thus we
suggest that users should also consider exclusivity
which provides a natural counterpoint.
This function is currently marked with the keyword internal because it does not have much error checking.
a numeric vector containing semantic coherence for each topic
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). "Optimizing semantic coherence in topic models." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics. Chicago
Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., et al. (2014). "Structural topic models for open ended survey responses." American Journal of Political Science, 58(4), 1064-1082.
searchK
plot.searchK
exclusivity
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
#maximum EM iterations set very low so example will run quickly.
#Run your models to convergence!
mod.out <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep), data=meta,
max.em.its=5)
semanticCoherence(mod.out, docs)
|
stm v1.3.0 (2017-09-08) successfully loaded. See ?stm for help.
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Stemming...
Creating Output...
Removing 640 of 1102 terms (640 of 3789 tokens) due to frequency
Your corpus now has 341 documents, 462 terms and 3149 tokens.Beginning Spectral Initialization
Calculating the gram matrix...
Finding anchor words...
...
Recovering initialization...
....
Initialization complete.
.................................................................................................................
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 1 (approx. per word bound = -5.623)
.................................................................................................................
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 2 (approx. per word bound = -5.508, relative change = 2.044e-02)
.................................................................................................................
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 3 (approx. per word bound = -5.468, relative change = 7.376e-03)
.................................................................................................................
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 4 (approx. per word bound = -5.451, relative change = 3.149e-03)
.................................................................................................................
Completed E-Step (0 seconds).
Completed M-Step.
Model Terminated Before Convergence Reached
[1] -108.80456 -115.22948 -99.19009
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.