knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

par(bg = '#f9f7f1')
reticulate::use_virtualenv("../env", required = TRUE)

Note that there is no universal way to assess the best number of topics (num_topics) to fit a set of document, see this post.

Preprocess

As stated in table 2 from this paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. Therefore a process to assess the best number of topics to apply to a corpus should return 2.

library(gensimr)

data("corpus", package = "gensimr")

texts <- prepare_documents(corpus)
dictionary <- corpora_dictionary(texts)
corpus_bow <- doc2bow(dictionary, texts)

tfidf <- model_tfidf(corpus_bow, id2word = dictionary)
corpus_tfidf <- wrap(tfidf, corpus_bow)

Tune

We can run multiple Latent Dirichlet Allocation models given different number of topics then assess which is best using the perplexity score.

models <- map_model(
  num_topics = c(2, 4, 8, 10, 12),
  corpus = corpus_tfidf, 
  id2word = dictionary
) 

plot(models)
get_perplexity_data(models)


news-r/gensimr documentation built on Jan. 9, 2021, 5:55 a.m.