knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) par(bg = '#f9f7f1') reticulate::use_virtualenv("../env", required = TRUE)
Note that there is no universal way to assess the best number of topics (num_topics
) to fit a set of document, see this post.
As stated in table 2 from this paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. Therefore a process to assess the best number of topics to apply to a corpus should return 2
.
library(gensimr) data("corpus", package = "gensimr") texts <- prepare_documents(corpus) dictionary <- corpora_dictionary(texts) corpus_bow <- doc2bow(dictionary, texts) tfidf <- model_tfidf(corpus_bow, id2word = dictionary) corpus_tfidf <- wrap(tfidf, corpus_bow)
We can run multiple Latent Dirichlet Allocation models given different number of topics then assess which is best using the perplexity score.
models <- map_model( num_topics = c(2, 4, 8, 10, 12), corpus = corpus_tfidf, id2word = dictionary ) plot(models) get_perplexity_data(models)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.