This vignette describes the most basic usage of the sentopics
package by estimating an LDA model and analysis it's output. Two other vignettes, describing time series and topic models with sentiment are also available.
knitr::opts_chunk$set( collapse = TRUE, comment = "#", fig.width = 7, fig.height = 4, fig.align = "center" )
The package is shipped with a sample of press conferences from the European Central bank. For ease of use, the press conferences have been pre-processed into a tokens
object from the quanteda
package. (See quanteda's introduction for details on these objects). The press conferences also contains meta-data which can be accessed using docvars()
.
The press conferences were obtained from ECB's website. The package also provides an helper function to replicate the creation of the dataset: get_ECB_press_conferences()
library("sentopics") data("ECB_press_conferences_tokens") print(ECB_press_conferences_tokens, 3) head(docvars(ECB_press_conferences_tokens))
sentopics
implements three types of topic model. The simplest, Latent Dirichlet Allocation (LDA), assumes that textual documents are issued from a generative process involving $K$ topics.
A given document $d$ is constituted of a list of words $d = (w_1, \dots, w_N)$, with $N$ being the document's length. Each word $w_i$ originates from a vocabulary consisting of $V$ distinct terms. Then, documents are generated from the following random process:
In sentopics
the LDA model is estimated through Gibbs sampling, that iteratively sample the topic assignment $z_i$ of every word of the corpus until reaching a convergence. The topic assignments are sampled from the following distribution: $$ p(z_i = k|w,z^{-i}) \propto
\frac{n_{k,v,.}^{-i} + \beta}{n_{k,.,.}^{-i} + V\beta}
\frac{n_{k,.,d}^{-i} + \alpha}{n_{.,.,d}^{-i} + K\alpha},$$ where $n_{k,v,d}$ is the count of words at index $v$ of the vocabulary, assigned to topic $k$ and part of document $d$. The replacement of one of the indices ${k,v,d}$ by a dot indicates instead the count for all topics, all vocabulary indices or all documents. The superscript $-i$ indicates that the current word position $i$ is left out from the count variables.
sentopics
The estimation of an LDA model is easily replicated using the LDA()
and fit()
function. The first function prepares the R
object and initialize the assignment of the latent topics. The second function estimates the model using Gibbs sampling for a given number of iterations. Note that fit()
may be used to iterate the model multiple times without resetting the estimation.
set.seed(123) lda <- LDA(ECB_press_conferences_tokens) lda lda <- fit(lda, iterations = 100) lda
Internally, the lda
object is stored as a list and contains the model's parameters and outputs.
str(lda, max.level = 1, give.attr = FALSE)
tokens
is the initial tokens object used to create the model. vocabulary
is a data.frame indexing the set of words. K
is the number of topics. alpha
is the hyperparameter of the document-topic mixtures. beta
is the hyperparameter of the topic-word mixtures. it
is the number of iterations of the model. za
contains the topic assignments of each word of the corpus. theta
are the estimated document-topic mixtures. phi
are the estimated topic-word mixtures. logLikelihood
is the log-likelihood of the model at each iteration.
Estimated mixtures are easily accessible through the $
operator. But the package also includes the topWords()
function to extract the most probable words of each topic. topWords()
includes three types of outputs: long data.table
/data-frame
, matrix
or ggplot
object (also accessible through the alias plot_topWords()
).
head(lda$theta) topWords(lda, output = "matrix")
In addition, document-level is facilitated through the use of the melt()
method, that joins estimated topical proportions to document metadata present in the tokens
input. This result in a long data.table
/data.frame
that can be used for plotting or easily reshaped to a wide format (for example using data.table::dcast
).
melt(lda, include_docvars = TRUE)
To ease the result analysis, we can rename the default topic labels using the sentopics_labels()
function. As a result, all outputs of the model will now display the custom labels.
sentopics_labels(lda) <- list( topic = c("Inflation", "Fiscal policy", "Governing council", "Financial sector", "Uncertainty") ) head(lda$theta) plot_topWords(lda) + ggplot2::theme_grey(base_size = 9)
Besides modifying topic labels, it is also possible to merge topics into a greater thematic. This is often useful when estimating a large number of topics (e.g, K > 15). The mergeTopics()
does this job and re-label topics accordingly.
merged <- mergeTopics(lda, list( `Big big thematic` = c(1, 3:5), `Fical policy` = 2 )) merged
Note that merging topics is only useful for presentation purpose. Using again fit
on a model with merged topics will drastically change the results as the current state of the model does not results from a standard estimation with the merged set of parameters.
Provided that the plotly
package is installed, one can also directly use plot()
on the estimated topic model to enjoy a dynamic view of topic proportions and their most probable words (presented as a screenshot hereafter to limit this vignette's size).
plot(lda)
suppressWarnings({ plotly::save_image(plot(lda), file = "plotly1.svg") })
knitr::include_graphics("plotly1.svg")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.