In news-r/gensimr: Topic Modelling

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

par(bg = '#f9f7f1')
reticulate::use_virtualenv("../env", required = TRUE)

Note that there is no universal way to assess the best number of topics (num_topics) to fit a set of document, see this post.

Preprocess

As stated in table 2 from this paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. Therefore a process to assess the best number of topics to apply to a corpus should return 2.

library(gensimr)

data("corpus", package = "gensimr")

texts <- prepare_documents(corpus)
dictionary <- corpora_dictionary(texts)
corpus_bow <- doc2bow(dictionary, texts)

tfidf <- model_tfidf(corpus_bow, id2word = dictionary)
corpus_tfidf <- wrap(tfidf, corpus_bow)

Tune

We can run multiple Latent Dirichlet Allocation models given different number of topics then assess which is best using the perplexity score.

models <- map_model(
  num_topics = c(2, 4, 8, 10, 12),
  corpus = corpus_tfidf, 
  id2word = dictionary
) 

plot(models)
get_perplexity_data(models)

news-r/gensimr documentation built on Jan. 9, 2021, 5:55 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

news-r/gensimr
Topic Modelling

In news-r/gensimr: Topic Modelling

Preprocess

Tune

R Package Documentation

Browse R Packages

We want your feedback!

news-r/gensimr Topic Modelling

In news-r/gensimr: Topic Modelling

Preprocess

Tune

R Package Documentation

Browse R Packages

We want your feedback!

news-r/gensimr
Topic Modelling