BTM | R Documentation |
The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)
A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm b=(wi,wj) is defined as: P(b) = ∑_k{P(wi|z)*P(wj|z)*P(z)} where k is the number of topics you want to extract.
Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for P(w|k)=phi and P(z)=theta.
BTM( data, k = 5, alpha = 50/k, beta = 0.01, iter = 1000, window = 15, background = FALSE, trace = FALSE, biterms, detailed = FALSE )
data |
a tokenised data frame containing one row per token with 2 columns
|
k |
integer with the number of topics to identify |
alpha |
numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k. |
beta |
numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w|z). Defaults to 0.01. |
iter |
integer with the number of iterations of Gibbs sampling |
window |
integer with the window size for biterm extraction. Defaults to 15. |
background |
logical if set to |
trace |
logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE. |
biterms |
optionally, your own set of biterms to use for modelling. |
detailed |
logical indicating to return detailed output containing as well the vocabulary and the biterms used to construct the model. Defaults to FALSE. |
an object of class BTM which is a list containing
model: a pointer to the C++ BTM model
K: the number of topics
W: the number of tokens in the data
alpha: the symmetric dirichlet prior probability of a topic P(z)
beta: the symmetric dirichlet prior probability of a word given the topic P(w|z)
iter: the number of iterations of Gibbs sampling
background: indicator if the first topic is set to the background topic that equals the empirical word distribution.
theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it
phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(w|z). the rownames of the matrix indicate the token w
vocab: a data.frame with columns token and freq indicating the frequency of occurrence of the tokens in data
. Only provided in case argument detailed
is set to TRUE
biterms: the result of a call to terms
with type set to biterms, containing all the biterms used in the model. Only provided in case argument detailed
is set to TRUE
A biterm is defined as a pair of words co-occurring in the same text window.
If you have as an example a document with sequence of words 'A B C B'
, and assuming the window size is set to 3,
that implies there are two text windows which can generate biterms namely
text window 'A B C'
with biterms 'A B', 'B C', 'A C'
and text window 'B C B'
with biterms 'B C', 'C B', 'B B'
A biterm is an unorder word pair where 'B C' = 'C B'
. Thus, the document 'A B C B'
will have the following biterm frequencies:
'A B': 1
'B C': 3
'A C': 1
'B B': 1
These biterms are used to create the model.
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
predict.BTM
, terms.BTM
, logLik.BTM
library(udpipe) data("brussels_reviews_anno", package = "udpipe") x <- subset(brussels_reviews_anno, language == "nl") x <- subset(x, xpos %in% c("NN", "NNP", "NNS")) x <- x[, c("doc_id", "lemma")] model <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE) model terms(model) scores <- predict(model, newdata = x) ## Another small run with first topic the background word distribution set.seed(123456) model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE) model terms(model) ## ## You can also provide your own set of biterms to cluster upon ## Example: cluster nouns and adjectives in the neighbourhood of one another ## library(data.table) library(udpipe) x <- subset(brussels_reviews_anno, language == "nl") x <- head(x, 5500) # take a sample to speed things up on CRAN biterms <- as.data.table(x) biterms <- biterms[, cooccurrence(x = lemma, relevant = xpos %in% c("NN", "NNP", "NNS", "JJ"), skipgram = 2), by = list(doc_id)] head(biterms) set.seed(123456) x <- subset(x, xpos %in% c("NN", "NNP", "NNS", "JJ")) x <- x[, c("doc_id", "lemma")] model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE, biterms = biterms, trace = 10, detailed = TRUE) model terms(model) bitermset <- terms(model, "biterms") head(bitermset$biterms, 100) bitermset$n sum(biterms$cooc) ## Not run: ## ## Visualisation either using the textplot or the LDAvis package ## library(textplot) library(ggraph) library(concaveman) plot(model, top_n = 4) library(LDAvis) docsize <- table(x$doc_id) scores <- predict(model, x) scores <- scores[names(docsize), ] json <- createJSON( phi = t(model$phi), theta = scores, doc.length = as.integer(docsize), vocab = model$vocabulary$token, term.frequency = model$vocabulary$freq) serVis(json) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.