In MansMeg/tidytopics: Tidy Topic Modeling Toolbox for R

The tidytopics package is a toolkit for topic modeling and analysis of topic models in R. The main focus is to facilitate pre-processing, post-processing and analysis of topic models for both practictioners and researchers. It does not contain any samplers for topic models (yet). For the topic model notation used throughout this package, see the notation section below.

This is a short introduction to the package, the (tidy) data structures and some basic functionalities implemented so far. This package is built upon the ideas of data processing proposed in the dplyr package and the tidytext R package for text analysis.

For the dplyr package a good introduction can be found here and a god cheetsheet can be found here. The package tidytext has a nice introduction here and a nice book on tidy text mining can be found here.

Vignettes

This package contains the following vignettes on topic modeling analysis. Use vignette() to access the vignettes from R.

Bayesian Checking of Topic Models: vignette("bayesian_checking")

Included datasets

The package contain datasets that can be accessed with data([dataset name]). Use ?[dataset name] to get more information on each dataset and data(package="tidytopics") to see a list of the datasets.

Basic data structures

The datastructures in tidytopics is based on the same tidy data structure proposed by Hadley Wickham in tidyr. So the following datastructures are tibble data.frames but with some restrictions on the content. There are three main data structures in tidytopics:

| Symbol| Description | |-------|-----------------------------------------| | tidy_topic_state | A topic model token level state (of topic indicators $z$) | | tidy_topic_array | A tidy array (aggregation) of topic indicators. Must have the count variable n and the integer variable topic | | tidy_topic_matrix | Same as tidy_topic_array but can only have one other variable than n and topic |

The purpose is to have a way of handling the often very large corpuses in a simple and effective way and make use of the sparsity for topic model analysis.

Example

The basic structure is the tidy_topic_state. As an example a topic model with 50 topics (one Gibbs draw) has been included for the State of the Union Addresses. Each document is a paragraph of the State of The Union Addresses.

suppressPackageStartupMessages(library(tidytopics))
suppressPackageStartupMessages(library(tidytext))
suppressPackageStartupMessages(library(dplyr))

library(tidytopics)
library(tidytext)
library(dplyr)

data("sotu50")
sotu50
is.tidy_topic_state(sotu50)

Based on this topic model state we can easily do post-processing analyses, such as the top terms per topic.

top_terms(sotu50, "type_probability")

See ?top_terms for other ways of choosing terms to represent a topic.

We can also easily compute the type topic matrix $n_w$ or the document topic matrix $n_d$ in a tidy format.

nw <- topic_type_matrix(sotu50)
nw
nd <- doc_topic_matrix(sotu50)
nd

These are tidy_topic_matrixes and also tidy_topic_arrays. But if we add an extra variable (such as President) the resulting tibble is not longer a tidy_topic_matrix but a tidy_topic_array.

is.tidy_topic_matrix(nw)
is.tidy_topic_array(nw)

data("sotu")
nd <- left_join(nd, sotu[,c("doc" )], by = "doc")

is.tidy_topic_matrix(nd)
is.tidy_topic_array(nd)

It is also possible to convert to and from a sparse matrix object from the Matrix package.

nw <- as.sparseMatrix(nw)
nw[1:10, 1:5]

For more analytical features see the other vignettes.

Import and export topic models

Currently import and export is only possible from a mallet state file.

Mallet state files

Currently the package handles importing of state files from Mallet. To fit a model with mallet see the hompage or use the R package mallet to fit a model. An introduction to fit a topic model with R can be found here.

The following mallet state file is not a part of this package (due to the size of the file ~8 Mb). But to run this code you can download the file here.

data("sotu50")
my_sotu <- sotu50
alpha <- c(0.02136692 ,0.01317826 ,0.03399012 ,0.04877238 ,0.05565182 ,0.02634909 ,0.07547259 ,0.1090051 ,0.0343327 ,0.1987157 ,0.01995965 ,0.1032003 ,0.0476333 ,0.02253602 ,0.1115289 ,0.01482319 ,0.01683189 ,0.04114055 ,0.0393674 ,0.01632598 ,0.02016105 ,0.03419158 ,0.2109707 ,0.02525466 ,0.06465434 ,0.02042515 ,0.04062194 ,0.04762343 ,0.04169344 ,0.05274674 ,0.02347339 ,0.02688192 ,0.01971497 ,0.08863154 ,0.02162647 ,0.08386104 ,0.01468114 ,0.01798629 ,0.04029905 ,0.05446011 ,0.02651762 ,0.03284185 ,0.02257509 ,0.06026494 ,0.07507902 ,0.07178521 ,0.03777182 ,0.0702841 ,0.02130745 ,0.0345557)
beta <- 0.01228582

The statefile contain the topic indicator states and the hyperparameters $\alpha$ and $\beta$.

my_sotu <- read_mallet_statefile("sotu50.txt.gz")
alpha <- read_mallet_statefile("sotu50.txt.gz", type = "alpha")
beta <- read_mallet_statefile("sotu50.txt.gz", type = "beta")

my_sotu
alpha[1:5]
beta

It is also possible to write out the current state file to a mallet state format.

write_mallet_statefile(state = my_sotu, 
                       alpha = alpha,
                       beta = beta,
                       state_file = "my_mallet_state.txt.gz")

Topic model notation used in this package

The general notation in this package is as follows:

| Symbol| Description | |-------|-----------------------------------------| | $V$ | Vocabulary size. Number of unique terms/types. | | $D$ | The number of documents. | | $K$ | The number of topics. | | $N$ | The number of tokens. | | $N_d$ | The number of tokens in document $d$. | | $w_{i,d}$ | Token $i$ in document $d$ | | $v_{i,d}$ | Type of token $i$ in document $d$ | | $z_{i,d}$ | Topic indicator for token $i$ in document $d$ | | $\Phi$ | Dirichlet distribution over types by topic of size $(K \times V)$ | | $\phi_k$ | Dirichlet distribution over types for topic $k$. | | $\Theta$ | Dirichlet distribution over topics by documents of size $(D \times K)$ | | $\theta_d$ | Dirichlet distribution over topics for document $d$. | | $n_w$ | Topic indicator counts by topics and types of size $(K \times V)$. | | $n_d$ | Topic indicator counts by documents and topic of size $(D \times K)$. | | $beta_{k,d}$ | Prior for $\Phi_{k,d}$. | | $alpha_{k,d}$ | Prior for $\Theta_{k,d}$. |