This vignette describes the most basic usage of the `sentopics`

package by estimating an LDA model and analysis it's output. Two other vignettes, describing time series and topic models with sentiment are also available.

knitr::opts_chunk$set( collapse = TRUE, comment = "#", fig.width = 7, fig.height = 4, fig.align = "center" )

The package is shipped with a sample of press conferences from the European Central bank. For ease of use, the press conferences have been pre-processed into a `tokens`

object from the `quanteda`

package. (See quanteda's introduction for details on these objects). The press conferences also contains meta-data which can be accessed using `docvars()`

.

The press conferences were obtained from ECB's website. The package also provides an helper function to replicate the creation of the dataset: `get_ECB_press_conferences()`

library("sentopics") data("ECB_press_conferences_tokens") print(ECB_press_conferences_tokens, 3) head(docvars(ECB_press_conferences_tokens))

`sentopics`

implements three types of topic model. The simplest, Latent Dirichlet Allocation (LDA), assumes that textual documents are issued from a generative process involving $K$ topics.

A given document $d$ is constituted of a list of words $d = (w_1, \dots, w_N)$, with $N$ being the document's length. Each word $w_i$ originates from a vocabulary consisting of $V$ distinct terms. Then, documents are generated from the following random process:

- For each topic $k \in K$, a distribution $\phi_k$ over the vocabulary is drawn. This distribution represent the probability of a word appearing given it belong to the topic and is drawn from a Dirichlet distribution with hyperparameter $\beta$. $$\phi \sim Dirichlet(\beta)$$
- For each document, a mixture of the $K$ topics, $\theta_d$, assign the probability of a word in document $d$ being generated from topic $k$. This mixture is also drawn from a Dirichlet distribution with hyperparameter $\alpha$. $$\theta \sim Dirichlet(\alpha)$$
- For each word position $i$ of document $d$, the following sequence of draws is executed:
- A latent topic assignment $z_i$ is drawn from the document mixture. $z_i \sim Multinomial(\theta)$
- A word $w_i$ is drawn from the topic's vocabulary distribution. $w_i \sim Multinomial(\phi_{z_i})$

In `sentopics`

the LDA model is estimated through Gibbs sampling, that iteratively sample the topic assignment $z_i$ of every word of the corpus until reaching a convergence. The topic assignments are sampled from the following distribution: $$ p(z_i = k|w,z^{-i}) \propto
\frac{n_{k,v,.}^{-i} + \beta}{n_{k,.,.}^{-i} + V\beta}
\frac{n_{k,.,d}^{-i} + \alpha}{n_{.,.,d}^{-i} + K\alpha},$$ where $n_{k,v,d}$ is the count of words at index $v$ of the vocabulary, assigned to topic $k$ and part of document $d$. The replacement of one of the indices ${k,v,d}$ by a dot indicates instead the count for all topics, all vocabulary indices or all documents. The superscript $-i$ indicates that the current word position $i$ is left out from the count variables.

`sentopics`

The estimation of an LDA model is easily replicated using the `LDA()`

and `grow()`

function. The first function prepares the `R`

object and initialize the assignment of the latent topics. The second function estimates the model using Gibbs sampling for a given number of iterations. Note that `grow()`

may be used to iterate the model multiple times without resetting the estimation.

set.seed(123) lda <- LDA(ECB_press_conferences_tokens) lda lda <- grow(lda, iterations = 100) lda

Internally, the `lda`

object is stored as a list and contains the model's parameters and outputs.

str(lda, max.level = 1, give.attr = FALSE)

`tokens`

is the initial tokens object used to create the model. `vocabulary`

is a data.frame indexing the set of words. `K`

is the number of topics. `alpha`

is the hyperparameter of the document-topic mixtures. `beta`

is the hyperparameter of the topic-word mixtures. `it`

is the number of iterations of the model. `za`

contains the topic assignments of each word of the corpus. `theta`

are the estimated document-topic mixtures. `phi`

are the estimated topic-word mixtures. `logLikelihood`

is the log-likelihood of the model at each iteration.

Estimated mixtures are easily accessible through the `$`

operator. But the package also includes the `topWords()`

function to extract the most probable words of each topic. `topWords()`

includes three types of outputs: *long* `data.table`

/`data-frame`

, `matrix`

or `ggplot`

object (also accessible through the alias `plot_topWords()`

).

head(lda$theta) topWords(lda, output = "matrix")

In addition, document-level is facilitated through the use of the `melt()`

method, that joins estimated topical proportions to document metadata present in the `tokens`

input. This result in a *long* `data.table`

/`data.frame`

that can be used for plotting or easily reshaped to a wide format (for example using `data.table::dcast`

).

melt(lda, include_docvars = TRUE)

To ease the result analysis, we can rename the default topic labels using the `sentopics_labels()`

function. As a result, all outputs of the model will now display the custom labels.

sentopics_labels(lda) <- list( topic = c("Inflation", "Fiscal policy", "Governing council", "Financial sector", "Uncertainty") ) head(lda$theta) plot_topWords(lda) + ggplot2::theme_grey(base_size = 9)

Besides modifying topic labels, it is also possible to merge topics into a greater thematic. This is often useful when estimating a large number of topics (e.g, K > 15). The `mergeTopics()`

does this job and re-label topics accordingly.

merged <- mergeTopics(lda, list( `Big big thematic` = c(1, 3:5), `Fical policy` = 2 )) merged

Note that merging topics is only useful for presentation purpose. Using again `grow`

on a model with merged topics will drastically change the results as the current state of the model does not results from a standard estimation with the merged set of parameters.

Provided that the `plotly`

package is installed, one can also directly use `plot()`

on the estimated topic model to enjoy a dynamic view of topic proportions and their most probable words (presented as a screenshot hereafter to limit this vignette's size).

```
plot(lda)
```

suppressWarnings({ plotly::save_image(plot(lda), file = "plotly1.svg") })

knitr::include_graphics("plotly1.svg")

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.