perplexity: Perplexity of a topic model

Description Usage Arguments Examples

View source: R/perplexity.R

Description

Given document-term matrix, topic-word distribution, document-topic distribution calculates perplexity

Usage

1
perplexity(X, topic_word_distribution, doc_topic_distribution)

Arguments

X

sparse document-term matrix which contains terms counts. Internally Matrix::RsparseMatrix is used. If class(X) != 'RsparseMatrix' function will try to coerce X to RsparseMatrix via as() call.

topic_word_distribution

dense matrix for topic-word distribution. Number of rows = n_topics, number of columns = vocabulary_size. Sum of elements in each row should be equal to 1 - each row is a distribution of words over topic.

doc_topic_distribution

dense matrix for document-topic distribution. Number of rows = n_documents, number of columns = n_topics. Sum of elements in each row should be equal to 1 - each row is a distribution of topics over document.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
library(text2vec)
data("movie_review")
n_iter = 10
train_ind = 1:200
ids = movie_review$id[train_ind]
txt = tolower(movie_review[['review']][train_ind])
names(txt) = ids
tokens = word_tokenizer(txt)
it = itoken(tokens, progressbar = FALSE, ids = ids)
vocab = create_vocabulary(it)
vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_min = 0.02)
dtm = create_dtm(it, vectorizer = vocab_vectorizer(vocab))
n_topic = 10
model = LDA$new(n_topic, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr  =
  model$fit_transform(dtm, n_iter = n_iter, n_check_convergence = 1,
                      convergence_tol = -1, progressbar = FALSE)
topic_word_distr_10 = model$topic_word_distribution
perplexity(dtm, topic_word_distr_10, doc_topic_distr)

Example output

INFO [2018-07-23 01:47:37] iter 1 loglikelihood = -190495.486
INFO [2018-07-23 01:47:37] iter 2 loglikelihood = -185686.296
INFO [2018-07-23 01:47:37] iter 3 loglikelihood = -181815.838
INFO [2018-07-23 01:47:37] iter 4 loglikelihood = -179421.035
INFO [2018-07-23 01:47:37] iter 5 loglikelihood = -177038.897
INFO [2018-07-23 01:47:37] iter 6 loglikelihood = -175294.319
INFO [2018-07-23 01:47:37] iter 7 loglikelihood = -173912.423
INFO [2018-07-23 01:47:37] iter 8 loglikelihood = -172635.478
INFO [2018-07-23 01:47:37] iter 9 loglikelihood = -171579.226
INFO [2018-07-23 01:47:37] iter 10 loglikelihood = -170751.010
[1] 246.5976

text2vec documentation built on Jan. 12, 2018, 1:04 a.m.