jaccardTopics: Pairwise Jaccard Coefficients

View source: R/jaccardTopics.R

jaccardTopicsR Documentation

Pairwise Jaccard Coefficients

Description

Calculates the similarity of all pairwise topic combinations using a modified Jaccard Coefficient.

Usage

jaccardTopics(
  topics,
  limit.rel,
  limit.abs,
  atLeast,
  progress = TRUE,
  pm.backend,
  ncpus
)

Arguments

topics

[named matrix]
The counts of vocabularies/words (row wise) in topics (column wise).

limit.rel

[0,1]
A relative lower bound limit for which words are taken into account. Those words are taken as relevant for a topic that have a count higher than limit.rel multiplied by the total count of the given topic. Default is 1/500.

limit.abs

[integer(1)]
An absolute lower bound limit for which words are taken into account. All words are taken as relevant for a topic that have a count higher than limit.abs. Default is 10.

atLeast

[integer(1)]
An absolute count of how many words are at least considered as relevant for a topic. Default is 0.

progress

[logical(1)]
Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is TRUE. If pm.backend is set, parallelization is done and no progress bar will be shown.

pm.backend

[character(1)]
One of "multicore", "socket" or "mpi". If pm.backend is set, parallelStart is called before computation is started and parallelStop is called after.

ncpus

[integer(1)]
Number of (physical) CPUs to use. If pm.backend is passed, default is determined by availableCores.

Details

The modified Jaccard Coefficient for two topics \bm z_{i} and \bm z_{j} is calculated by

J_m(\bm z_{i}, \bm z_{j} \mid \bm c) = \frac{∑_{v = 1}^{V} 1_{≤ft\{n_{i}^{(v)} > c_i ~\wedge~ n_{j}^{(v)} > c_j\right\}}≤ft(n_{i}^{(v)}, n_{j}^{(v)}\right)}{∑_{v = 1}^{V} 1_{≤ft\{n_{i}^{(v)} > c_i ~\vee~ n_{j}^{(v)} > c_j\right\}}≤ft(n_{i}^{(v)}, n_{j}^{(v)}\right)}

with V is the vocabulary size and n_k^{(v)} is the count of assignments of the v-th word to the k-th topic. The threshold vector \bm c is determined by the maximum threshold of the user given lower bounds limit.rel and limit.abs. In addition, at least atLeast words per topic are considered for calculation. According to this, if there are less than atLeast words considered as relevant after applying limit.rel and limit.abs the atLeast most common words per topic are taken to determine topic similarities.

The procedure of determining relevant words is executed for each topic individually. The values wordslimit and wordsconsidered describes the number of relevant words per topic.

Value

[named list] with entries

sims

[lower triangular named matrix] with all pairwise jaccard similarities of the given topics.

wordslimit

[integer] with counts of words determined as relevant based on limit.rel and limit.abs.

wordsconsidered

[integer] with counts of considered words for similarity calculation. Could differ from wordslimit, if atLeast is greater than zero.

param

[named list] with parameter specifications for type [character(1)] = "Jaccard Coefficient", limit.rel [0,1], limit.abs [integer(1)] and atLeast [integer(1)]. See above for explanation.

See Also

Other TopicSimilarity functions: cosineTopics(), dendTopics(), getSimilarity(), jsTopics(), rboTopics()

Other workflow functions: LDARep(), SCLOP(), dendTopics(), getPrototype(), mergeTopics()

Examples

res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
jacc

n1 = getConsideredWords(jacc)
n2 = getRelevantWords(jacc)
(n1 - n2)[n1 - n2 != 0]

sim = getSimilarity(jacc)
dim(sim)

# Comparison to Cosine and Jensen-Shannon (more interesting on large datasets)
cosine = cosineTopics(topics)
js = jsTopics(topics)

sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js))
pairs(do.call(cbind, lapply(sims, as.vector)))


JonasRieger/ldaPrototype documentation built on Feb. 5, 2023, 6:45 p.m.