View source: R/jaccardTopics.R
jaccardTopics | R Documentation |
Calculates the similarity of all pairwise topic combinations using a modified Jaccard Coefficient.
jaccardTopics( topics, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus )
topics |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
The modified Jaccard Coefficient for two topics \bm z_{i} and \bm z_{j} is calculated by
J_m(\bm z_{i}, \bm z_{j} \mid \bm c) = \frac{∑_{v = 1}^{V} 1_{≤ft\{n_{i}^{(v)} > c_i ~\wedge~ n_{j}^{(v)} > c_j\right\}}≤ft(n_{i}^{(v)}, n_{j}^{(v)}\right)}{∑_{v = 1}^{V} 1_{≤ft\{n_{i}^{(v)} > c_i ~\vee~ n_{j}^{(v)} > c_j\right\}}≤ft(n_{i}^{(v)}, n_{j}^{(v)}\right)}
with V is the vocabulary size and n_k^{(v)} is the count of
assignments of the v-th word to the k-th topic. The threshold vector \bm c
is determined by the maximum threshold of the user given lower bounds limit.rel
and limit.abs
. In addition, at least atLeast
words per topic are
considered for calculation. According to this, if there are less than
atLeast
words considered as relevant after applying limit.rel
and limit.abs
the atLeast
most common words per topic are taken
to determine topic similarities.
The procedure of determining relevant words is executed for each topic individually.
The values wordslimit
and wordsconsidered
describes the number
of relevant words per topic.
[named list
] with entries
sims
[lower triangular named matrix
] with all pairwise
jaccard similarities of the given topics.
wordslimit
[integer
] with counts of words determined as
relevant based on limit.rel
and limit.abs
.
wordsconsidered
[integer
] with counts of considered
words for similarity calculation. Could differ from wordslimit
, if
atLeast
is greater than zero.
param
[named list
] with parameter specifications for
type
[character(1)
] = "Jaccard Coefficient"
,
limit.rel
[0,1], limit.abs
[integer(1)
] and
atLeast
[integer(1)
]. See above for explanation.
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jsTopics()
,
rboTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
getPrototype()
,
mergeTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) jacc n1 = getConsideredWords(jacc) n2 = getRelevantWords(jacc) (n1 - n2)[n1 - n2 != 0] sim = getSimilarity(jacc) dim(sim) # Comparison to Cosine and Jensen-Shannon (more interesting on large datasets) cosine = cosineTopics(topics) js = jsTopics(topics) sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js)) pairs(do.call(cbind, lapply(sims, as.vector)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.