| BTM | R Documentation | 
The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)
A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm b=(wi,wj) is defined as: P(b) = ∑_k{P(wi|z)*P(wj|z)*P(z)} where k is the number of topics you want to extract.
Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for P(w|k)=phi and P(z)=theta.
BTM( data, k = 5, alpha = 50/k, beta = 0.01, iter = 1000, window = 15, background = FALSE, trace = FALSE, biterms, detailed = FALSE )
| data | a tokenised data frame containing one row per token with 2 columns 
 | 
| k | integer with the number of topics to identify | 
| alpha | numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k. | 
| beta | numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w|z). Defaults to 0.01. | 
| iter | integer with the number of iterations of Gibbs sampling | 
| window | integer with the window size for biterm extraction. Defaults to 15. | 
| background | logical if set to  | 
| trace | logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE. | 
| biterms | optionally, your own set of biterms to use for modelling. | 
| detailed | logical indicating to return detailed output containing as well the vocabulary and the biterms used to construct the model. Defaults to FALSE. | 
an object of class BTM which is a list containing
model: a pointer to the C++ BTM model
K: the number of topics
W: the number of tokens in the data
alpha: the symmetric dirichlet prior probability of a topic P(z)
beta: the symmetric dirichlet prior probability of a word given the topic P(w|z)
iter: the number of iterations of Gibbs sampling
background: indicator if the first topic is set to the background topic that equals the empirical word distribution.
theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it
phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(w|z). the rownames of the matrix indicate the token w
vocab: a data.frame with columns token and freq indicating the frequency of occurrence of the tokens in data. Only provided in case argument detailed is set to TRUE
biterms: the result of a call to terms with type set to biterms, containing all the biterms used in the model. Only provided in case argument detailed is set to TRUE
A biterm is defined as a pair of words co-occurring in the same text window. 
If you have as an example a document with sequence of words 'A B C B', and assuming the window size is set to 3, 
that implies there are two text windows which can generate biterms namely 
text window 'A B C' with biterms 'A B', 'B C', 'A C' and text window 'B C B' with biterms 'B C', 'C B', 'B B'
A biterm is an unorder word pair where 'B C' = 'C B'. Thus, the document 'A B C B' will have the following biterm frequencies: 
'A B': 1
'B C': 3
'A C': 1
'B B': 1
These biterms are used to create the model.
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
predict.BTM, terms.BTM, logLik.BTM
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model  <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
model
terms(model)
scores <- predict(model, newdata = x)
## Another small run with first topic the background word distribution
set.seed(123456)
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE)
model
terms(model)
##
## You can also provide your own set of biterms to cluster upon
## Example: cluster nouns and adjectives in the neighbourhood of one another
##
library(data.table)
library(udpipe)
x <- subset(brussels_reviews_anno, language == "nl")
x <- head(x, 5500) # take a sample to speed things up on CRAN
biterms <- as.data.table(x)
biterms <- biterms[, cooccurrence(x = lemma, 
                                  relevant = xpos %in% c("NN", "NNP", "NNS", "JJ"),
                                  skipgram = 2), 
                   by = list(doc_id)]
head(biterms)
set.seed(123456)
x <- subset(x, xpos %in% c("NN", "NNP", "NNS", "JJ"))
x <- x[, c("doc_id", "lemma")]
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE, 
             biterms = biterms, trace = 10, detailed = TRUE)
model
terms(model)
bitermset <- terms(model, "biterms")
head(bitermset$biterms, 100)
bitermset$n
sum(biterms$cooc)
## Not run: 
##
## Visualisation either using the textplot or the LDAvis package
##
library(textplot)
library(ggraph)
library(concaveman)
plot(model, top_n = 4)
library(LDAvis)
docsize <- table(x$doc_id)
scores  <- predict(model, x)
scores  <- scores[names(docsize), ]
json <- createJSON(
  phi = t(model$phi), 
  theta = scores, 
  doc.length = as.integer(docsize),
  vocab = model$vocabulary$token, 
  term.frequency = model$vocabulary$freq)
serVis(json)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.