The tidytopics package is a toolkit for topic modeling and analysis of topic models in R. The main focus is to facilitate pre-processing, post-processing and analysis of topic models for both practictioners and researchers. It does not contain any samplers for topic models (yet). For the topic model notation used throughout this package, see the notation section below.
This is a short introduction to the package, the (tidy) data structures and some basic functionalities implemented so far. This package is built upon the ideas of data processing proposed in the dplyr
package and the tidytext
R package for text analysis.
For the dplyr
package a good introduction can be found here and a god cheetsheet can be found here. The package tidytext
has a nice introduction here and a nice book on tidy text mining can be found here.
This package contains the following vignettes on topic modeling analysis. Use vignette()
to access the vignettes from R.
vignette("bayesian_checking")
The package contain datasets that can be accessed with data([dataset name])
. Use ?[dataset name]
to get more information on each dataset and data(package="tidytopics")
to see a list of the datasets.
The datastructures in tidytopics is based on the same tidy data structure proposed by Hadley Wickham in tidyr
. So the following datastructures are tibble
data.frames
but with some restrictions on the content. There are three main data structures in tidytopics:
| Symbol| Description |
|-------|-----------------------------------------|
| tidy_topic_state
| A topic model token level state (of topic indicators $z$) |
| tidy_topic_array
| A tidy array (aggregation) of topic indicators. Must have the count variable n
and the integer variable topic
|
| tidy_topic_matrix
| Same as tidy_topic_array
but can only have one other variable than n
and topic
|
The purpose is to have a way of handling the often very large corpuses in a simple and effective way and make use of the sparsity for topic model analysis.
The basic structure is the tidy_topic_state
. As an example a topic model with 50 topics (one Gibbs draw) has been included for the State of the Union Addresses. Each document is a paragraph of the State of The Union Addresses.
suppressPackageStartupMessages(library(tidytopics)) suppressPackageStartupMessages(library(tidytext)) suppressPackageStartupMessages(library(dplyr))
library(tidytopics) library(tidytext) library(dplyr) data("sotu50") sotu50 is.tidy_topic_state(sotu50)
Based on this topic model state we can easily do post-processing analyses, such as the top terms per topic.
top_terms(sotu50, "type_probability")
See ?top_terms
for other ways of choosing terms to represent a topic.
We can also easily compute the type topic matrix $n_w$ or the document topic matrix $n_d$ in a tidy format.
nw <- topic_type_matrix(sotu50) nw nd <- doc_topic_matrix(sotu50) nd
These are tidy_topic_matrix
es and also tidy_topic_array
s. But if we add an extra variable (such as President) the resulting tibble is not longer a tidy_topic_matrix
but a tidy_topic_array
.
is.tidy_topic_matrix(nw) is.tidy_topic_array(nw) data("sotu") nd <- left_join(nd, sotu[,c("doc" )], by = "doc") is.tidy_topic_matrix(nd) is.tidy_topic_array(nd)
It is also possible to convert to and from a sparse matrix object from the Matrix
package.
nw <- as.sparseMatrix(nw) nw[1:10, 1:5]
For more analytical features see the other vignettes.
Currently import and export is only possible from a mallet state file.
Currently the package handles importing of state files from Mallet. To fit a model with mallet see the hompage or use the R package mallet to fit a model. An introduction to fit a topic model with R can be found here.
The following mallet state file is not a part of this package (due to the size of the file ~8 Mb). But to run this code you can download the file here.
data("sotu50") my_sotu <- sotu50 alpha <- c(0.02136692 ,0.01317826 ,0.03399012 ,0.04877238 ,0.05565182 ,0.02634909 ,0.07547259 ,0.1090051 ,0.0343327 ,0.1987157 ,0.01995965 ,0.1032003 ,0.0476333 ,0.02253602 ,0.1115289 ,0.01482319 ,0.01683189 ,0.04114055 ,0.0393674 ,0.01632598 ,0.02016105 ,0.03419158 ,0.2109707 ,0.02525466 ,0.06465434 ,0.02042515 ,0.04062194 ,0.04762343 ,0.04169344 ,0.05274674 ,0.02347339 ,0.02688192 ,0.01971497 ,0.08863154 ,0.02162647 ,0.08386104 ,0.01468114 ,0.01798629 ,0.04029905 ,0.05446011 ,0.02651762 ,0.03284185 ,0.02257509 ,0.06026494 ,0.07507902 ,0.07178521 ,0.03777182 ,0.0702841 ,0.02130745 ,0.0345557) beta <- 0.01228582
The statefile contain the topic indicator states and the hyperparameters $\alpha$ and $\beta$.
my_sotu <- read_mallet_statefile("sotu50.txt.gz") alpha <- read_mallet_statefile("sotu50.txt.gz", type = "alpha") beta <- read_mallet_statefile("sotu50.txt.gz", type = "beta")
my_sotu
alpha[1:5]
beta
It is also possible to write out the current state file to a mallet state format.
write_mallet_statefile(state = my_sotu, alpha = alpha, beta = beta, state_file = "my_mallet_state.txt.gz")
The general notation in this package is as follows:
| Symbol| Description | |-------|-----------------------------------------| | $V$ | Vocabulary size. Number of unique terms/types. | | $D$ | The number of documents. | | $K$ | The number of topics. | | $N$ | The number of tokens. | | $N_d$ | The number of tokens in document $d$. | | $w_{i,d}$ | Token $i$ in document $d$ | | $v_{i,d}$ | Type of token $i$ in document $d$ | | $z_{i,d}$ | Topic indicator for token $i$ in document $d$ | | $\Phi$ | Dirichlet distribution over types by topic of size $(K \times V)$ | | $\phi_k$ | Dirichlet distribution over types for topic $k$. | | $\Theta$ | Dirichlet distribution over topics by documents of size $(D \times K)$ | | $\theta_d$ | Dirichlet distribution over topics for document $d$. | | $n_w$ | Topic indicator counts by topics and types of size $(K \times V)$. | | $n_d$ | Topic indicator counts by documents and topic of size $(D \times K)$. | | $beta_{k,d}$ | Prior for $\Phi_{k,d}$. | | $alpha_{k,d}$ | Prior for $\Theta_{k,d}$. |
This vignette was created with:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.