knitr::read_chunk(here::here("code", "chunk-options.R"))
devtools::load_all()



Color distributions by topic

The last thing we'll look at before presenting plots for the final model is the color distribution over each topic. This gives us a picture of what our color themes actually are!

library(dplyr)
library(purrr)
library(ggplot2)
load_data(sample_data = FALSE)
knitr::read_chunk(here::here("code", "compare-models.R"))

Color distributions over topics

For these plots the distribution is represented by a weighted relevance score (the same that is used in the [LDAvis` package](http://www.kennyshirley.com/LDAvis/#topic=0&lambda=0.61&term=).

The beta $\beta$ matrix, gives the posterior distribution of words given a topic, $p(w|t)$. Relevance is computed [ \text{relevance}(w|t) = \lambda \cdot p(w|t) + (1-\lambda)\cdot \frac{p(w|t)}{p(w)}. ]


How many themes?

Even though our model scores might have leaned towards a model with fewere topics, we can see specific topics where adding more models separates themes that appear to be quite different. The firs two examples are of topic # 2 from the 30 topic model which seems more coherent in the 40 topic model (the sencond plot).

Topic 2 from the 30 topic model



Topic 2 from the 40 topic model



One final plot from topic 32 that I looked questionable but seems to have grouped some related (if small) sets.





nateaff/legolda documentation built on May 18, 2019, 10:15 a.m.