semanticCoherence: Semantic Coherence

Description Usage Arguments Details Value References See Also Examples

View source: R/semanticCoherence.R

Description

Calculate semantic coherence (Mimno et al 2011) for an STM model.

Usage

1
semanticCoherence(model, documents, M = 10)

Arguments

model

the STM object

documents

the STM formatted documents (see stm for format).

M

the number of top words to consider per topic

Details

Semantic coherence is a metric related to pointwise mutual information that was introduced in a paper by David Mimno, Hanna Wallach and colleagues (see references), The paper details a series of manual evaluations which show that their metric is a reasonable surrogate for human judgment. The core idea here is that in models which are semantically coherent the words which are most probable under a topic should co-occur within the same document.

One of our observations in Roberts et al 2014 was that semantic coherence alone is relatively easy to achieve by having only a couple of topics which all are dominated by the most common words. Thus we suggest that users should also consider exclusivity which provides a natural counterpoint.

This function is currently marked with the keyword internal because it does not have much error checking.

Value

a numeric vector containing semantic coherence for each topic

References

Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). "Optimizing semantic coherence in topic models." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics. Chicago

Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., et al. (2014). "Structural topic models for open ended survey responses." American Journal of Political Science, 58(4), 1064-1082. http://goo.gl/0x0tHJ

See Also

searchK plot.searchK exclusivity

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
#maximum EM iterations set very low so example will run quickly.
#Run your models to convergence!
mod.out <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep), data=meta,
               max.em.its=5)
semanticCoherence(mod.out, docs)

Example output

stm v1.3.0 (2017-09-08) successfully loaded. See ?stm for help.
Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Stemming... 
Creating Output... 
Removing 640 of 1102 terms (640 of 3789 tokens) due to frequency 
Your corpus now has 341 documents, 462 terms and 3149 tokens.Beginning Spectral Initialization 
	 Calculating the gram matrix...
	 Finding anchor words...
 	...
	 Recovering initialization...
 	....
Initialization complete.
.................................................................................................................
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 1 (approx. per word bound = -5.623) 
.................................................................................................................
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 2 (approx. per word bound = -5.508, relative change = 2.044e-02) 
.................................................................................................................
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 3 (approx. per word bound = -5.468, relative change = 7.376e-03) 
.................................................................................................................
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 4 (approx. per word bound = -5.451, relative change = 3.149e-03) 
.................................................................................................................
Completed E-Step (0 seconds). 
Completed M-Step. 
Model Terminated Before Convergence Reached 
[1] -108.80456 -115.22948  -99.19009

stm documentation built on Dec. 18, 2019, 1:47 a.m.