knitr::read_chunk(here::here("code", "chunk-options.R"))



Selecting a topic model

As with kmeans clustering the LDA model requires the number of topics $k$ to be selected by the user. In this section we test out several methods that could be used to automate the scoring and selection

Cross validation on perplexity

The LDA model learns to posterior distributions which are the optimization routine's best guess at the distributions that generated the data. One method to test how good those distributions fit our data is to compare the learned distribution on a training set to the distribution of a holdout set. Perplexity is one measure of the difference between estimated topic distributions on documents.

We cast the set_word table to a document term matrix using the tidytext function. This returns adocumentTermMatrix object from the tm package. The LDA function we use is from the topicmodels package. It has a variational expectation maximation method and a Gibbs sampling method and I used the former.

devtools::load_all()
knitr::read_chunk(here::here("code", "perplexity-cv.R"))



K-topic grid

Since I tested the running time on different set numbers I have seen that for this small number of sets there aren't many topics. I have included a few more topic values $k$ on the lower end but including some higher values to better see the trend.



ldatuning topic scores

The ldatuning package has several other metrics of the quality of the topic models. I have modified the main function from the package to only return the scores. (The original package computes the models first and then the scores).

knitr::read_chunk(here::here("code", "ldatuning-scores.R"))

Topic coherence

There are several version of topic coherence which measure the pairwise strength of the relationship of the top terms in a topic model. Given some score where a larger value indicates a stronger relation ship between two words $w_i, w_j$, a generic coherence score is the sum over the top terms in a topic model:
$$ \sum_{w_i, w_j \in W_t} \text{Score}(w_i, w_j), $$ with top terms $W_t$ for each topic $t$.

The coherence score used in the SpeedReader coherence function just uses the internal coherence of the top terms. I compared the scores for the top 3, 5 and 10 terms.

knitr::read_chunk(here::here("code", "coherence-scores.R"))

Cluster scoring

We can also treat the LDA models as clustering the LEGO sets. We can assign the LEGO set to the color topic which has the highest value for that document; This is the topic that is most responsible for generating the document.

The previous plot should indicate whether documents are getting strongly associated with a topic or if topics are to evenly distributed over all documents.

Clustering sets with kmeans

In this next section, I cluster documents using both kmeans and LDA topics. Kmeans is intended as a simple baseline clustering method and sets are clustered based on their term vectors weighted by TF-IDF scores.

Cluster analysis

The clusters scores include Rand, adjusted Rand, Folkes-Mallow and Jaccard scores. All try to score a clustering on how well the discovered labels match the assigned labels -- here the root_id of the set. The Rand index assigns a score based on the number pairwise agreement of the cluster labels with the original labels. The other measures are somewhat similar in approach.

devtools::load_all()
knitr::read_chunk(here::here("code", "compare-cluster-scores.R"))

Topic Distribution

Another way to evaluate the quality of the topic models is to see how well documents are sorted into topics. This example follows a this section from the tidy text mining book.

The topic model's gamma matrix has the distribution of topics over models: [ \text{gamma} = p(t|d) ] for topic $t$ and document $d$.

The plot below visualizes this as how the topics are distributed over the probability bins for each topic. If too many topics have sets or documents in the low probability bins then you may have too high a number of topics since few documents are being strongly associated with any topic.

knitr::read_chunk(here::here("code", "gamma-distribution.R"))

Following the tidytext book, look at the distribution over topics.





nateaff/legolda documentation built on May 18, 2019, 10:15 a.m.