Description Usage Arguments Details Value References Examples
Given a topic model with topics represented as ordered term lists, the coherence may be used to assess the quality of individual topics.
This function is an implementation of several of the numerous possible metrics for such kind of assessments.
Coherence calculation is sensitive to the content of the reference tcm
that is used for evaluation
and that may be created with different parameter settings. Please refer to the details section (or reference section) for information
on typical combinations of metric and type of tcm
. For more general information on measuring coherence
a starting point is given in the reference section.
1 2 3 
x 
A 
tcm 
The term cooccurrence matrix, e.g, a 
metrics 
Character vector specifying the metrics to be calculated. Currently the following metrics are implemented:

smooth 
Numeric smoothing constant to avoid logarithm of zero. By default, set to 
n_doc_tcm 
The 
The currently implemented coherence metrics
are described below including a description of the
content type of the tcm
that showed good performance in combination with a specific metric.
For details on how to create tcm
see the example section.
For details on performance of metrics see the resources in the reference section
that served for definition of standard settings for individual metrics.
Note that depending on the use case, still, different settings than the standard settings for creation of tcm
may be reasonable.
Note that for all currently implemented metrics the tcm
is reduced to the top word space on basis of the terms in x
.
Considering the use case of finding the optimum number of topics among several models with different metrics, calculating the mean score over all topics and normalizing this mean coherence scores from different metrics might be considered for direct comparison.
Each metric usually opts for a different optimum number of topics. From initial experience it may be assumed that logratio, pmi and nmpi usually opt for smaller numbers, whereas the other metrics rather tend to propose higher numbers.
Implemented metrics:
"mean_logratio"
The logarithmic ratio is calculated as
log(smooth + tcm[x,y])  log(tcm[y,y])
,
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations are list(c(2,1), c(3,1), c(3,2))
.
The tcm
should represent the boolean term cooccurrence (internally the actual counts are used)
in the original documents and, therefore, is an intrinsic metric in the standard use case.
This metric is similar to the UMass metric, however, with a smaller smoothing constant by default
and using the mean for aggregation instead of the sum.
"mean_pmi"
The pointwise mutual information is calculated as
log2((tcm[x,y]/n_doc_tcm) + smooth)  log2(tcm[x,x]/n_doc_tcm)  log2(tcm[y,y]/n_doc_tcm)
,
where x and y are term index pairs from an arbitrary term index combination
that subsets the lower or upper triangle of tcm
, e.g. "preceding".
The tcm
should represent term cooccurrences within a boolean sliding window of size 10
(internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
This metric is similar to the UCI metric, however, with a smaller smoothing constant by default
and using the mean for aggregation instead of the sum.
"mean_npmi"
Similar (in terms of all parameter settings, etc.) to "mean_pmi" metric
but using the normalized pmi instead, which is calculated as
(log2((tcm[x,y]/n_doc_tcm) + smooth)  log2(tcm[x,x]/n_doc_tcm)  log2(tcm[y,y]/n_doc_tcm)) / log2((tcm[x,y]/n_doc_tcm) + smooth)
,
This metric may perform better than the simpler pmi metric.
"mean_difference"
The difference is calculated as
tcm[x,y]/tcm[x,x]  (tcm[y,y]/n_tcm_windows)
,
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations are list(c(1,2), c(1,3), c(2,3))
.
The tcm
should represent the boolean term cooccurrence (internally probabilities are used)
in the original documents and, therefore, is an intrinsic metric in the standard use case.
"mean_npmi_cosim"
First, the npmi of an individual top word with each of the top words is calculated as in "mean_npmi".
This result in a vector of npmi values for each top word.
On this basis, the cosine similarity between each pair of vectors is calculated.
The tcm
should represent term cooccurrences within a boolean sliding window of size 5
(internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
"mean_npmi_cosim2"
First, a vector of npmi values for each top word is calculated as in "mean_npmi_cosim".
On this basis, the cosine similarity between each vector and the sum of all vectors is calculated
(instead of the similarity between each pair).
The tcm
should represent term cooccurrences within a boolean sliding window of size 110
(internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
A numeric matrix
with the coherence scores of the specified metrics
per topic.
Below mentioned paper is the main theoretical basis for this code.
Currently only a selection of metrics stated in this paper is included in this R implementation.
Authors: Roeder, Michael; Both, Andreas; Hinneburg, Alexander (2015)
Title: Exploring the Space of Topic Coherence Measures.
In: Xueqi Cheng, Hang Li, Evgeniy Gabrilovich und Jie Tang (Eds.):
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining  WSDM '15.
the Eighth ACM International Conference. Shanghai, China, 02.02.2015  06.02.2015.
New York, USA: ACM Press, p. 399408.
https://dl.acm.org/citation.cfm?id=2685324
This paper has been implemented by above listed authors as the Java program "palmetto".
See https://github.com/dicegroup/Palmetto or http://aksw.org/Projects/Palmetto.html.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51  library(data.table)
library(text2vec)
library(Matrix)
data("movie_review")
N = 500
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it = itoken(tokens, progressbar = FALSE)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2)
dtm = create_dtm(it, vocab_vectorizer(v))
n_topics = 10
lda_model = LDA$new(n_topics)
fitted = lda_model$fit_transform(dtm, n_iter = 20)
tw = lda_model$get_top_words(n = 10, lambda = 1)
# for demonstration purposes create intrinsic TCM from original documents
# scores might not make sense for metrics that are designed for extrinsic TCM
tcm = crossprod(sign(dtm))
# check coherence
logger = lgr::get_logger('text2vec')
logger$set_threshold('debug')
res = coherence(tw, tcm, n_doc_tcm = N)
res
# example how to create TCM for extrinsic measures from an external corpus
external_reference_corpus = tolower(movie_review$review[501:1000])
tokens_ext = word_tokenizer(external_reference_corpus)
iterator_ext = itoken(tokens_ext, progressbar = FALSE)
v_ext = create_vocabulary(iterator_ext)
# for reasons of efficiency vocabulary may be reduced to the terms matched in the original corpus
v_ext= v_ext[v_ext$term %in% v$term, ]
# external vocabulary may be pruned depending on the use case
v_ext = prune_vocabulary(v_ext, term_count_min = 5, doc_proportion_max = 0.2)
vectorizer_ext = vocab_vectorizer(v_ext)
# for demonstration purposes a boolean cooccurrence within sliding window of size 10 is used
# 10 represents sentence cooccurrence, a size of 110 would, e.g., be paragraph cooccurrence
window_size = 5
tcm_ext = create_tcm(iterator_ext, vectorizer_ext
,skip_grams_window = window_size
,weights = rep(1, window_size)
,binary_cooccurence = TRUE
)
#add marginal probabilities in diagonal (by default only upper triangle of tcm is created)
diag(tcm_ext) = attributes(tcm_ext)$word_count
# get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument
n_skip_gram_windows = sum(sapply(tokens, function(x) {length(x)}))

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.