library(knitr) opts_chunk$set(fig.path = "figures_lda/")
Topic modelling techniques such as Latent Dirichlet Allocation (LDA) can be a usefull tool for social scientists to analyze large amounts of natural language data. Algorithms for LDA are available in R, for instance in the topicmodels
package. In this howto we demonstrate several function in the corpustools
package that facilitate the use of LDA using the topicmodels
package.
As a starting point we use a Document Term Matrix (dtm) in the DocumentTermMatrix
format offered in the tm
package. Note that we also offer a howto for creating the dtm.
library(corpustools) data(sotu) # state of the union speeches by Barack Obama and George H. Bush. head(sotu.tokens) sotu.tokens = sotu.tokens[sotu.tokens$pos1 %in% c('N','M','A'),] # only select nouns, proper nouns and adjectives. dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma) dtm
Not all terms are equally informative of the underlying semantic structures of texts, and some terms are rather useless for this purpose. For interpretation and computational purposes it is worthwhile to delete some of the less usefull words from the dtm before fitting the LDA model. We offer the term.statistics
function to get some basic information on the vocabulary (i.e. the total set of terms) of the corpus.
termstats = term.statistics(dtm) head(termstats)
We can now filter out words based on this information. In our example, we filter on terms that occur at least in five documents and that do not contain numbers. We also select only the 3000 terms with the highest tf-idf score (3000 is not a common standard. For large corpora it makes sense to include more terms).
termstats = termstats[termstats$docfreq >= 5 & termstats$number==F,] filtered_dtm = dtm[,termstats$term] # select only the terms we want to keep
Now we are ready to fit the model! We made a wrapper called lda.fit
for the LDA
function in the topicmodels
package. This wrapper doesn't do anything interesting, except for deleting empty columns/rows from the dtm, which can occur after filtering out words.
The main input for topmod.lda.fit
is:
- the document term matrix
- K: the number of topics (this has to be defined a priori)
- Optionally, it can be usefull to increase the number of iterations. This takes more time, but increases performance.
m = lda.fit(filtered_dtm, K=20, num.iterations=1000) terms(m, 10)[,1:5] # show first 5 topics, with ten top words per topic
We now have a fitted lda model. The terms
function shows the most prominent words for each topic (we only selected the first 5 topics for convenience).
One of the thing we can do with the LDA topics, is analyze how much attention they get over time, and how much they are used by different sources (e.g., people, newspapers, organizations). To do so, we need to match this article metadata. We can order the metadata to the documents in the LDA model by matching it to the documents slot.
head(sotu.meta) meta = sotu.meta[match(m@documents, sotu.meta$id),]
We can now do some plotting. First, we can make a wordcloud for a more fancy (and actually quite informative and intuitive) representation of the top words of a topic.
lda.plot.wordcloud(m, topic_nr=1) lda.plot.wordcloud(m, topic_nr=2)
With lda.plot.time
and lda.plot.category
, we can plot the salience of the topic over time and for a given categorical variable.
lda.plot.time(m, 1, meta$date, date_interval='year') lda.plot.category(m, 1, meta$headline)
It can be usefull to print all this information together. That is what the following function does.
lda.plot.topic(m, 1, meta$date, meta$headline, date_interval='year') lda.plot.topic(m, 2, meta$date, meta$headline, date_interval='year')
With the topics.plot.alltopics
function all topics can be visualized and saved as images. This function words the same as topics.plot.topic
, with an additional argument to specify the folder in which the images should be saved.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.