title: "Topic Modelling"
author: "Josh", "Callen", "Alana"
date: "r Sys.Date()
"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Topic Modelling}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
In machine learning and natural language processing, topic models are generative models which provide a framework or term frequency occurrences in a given selection of documents (e.g.,news articles, documents, social media texts). The Latent Dirichlet Allocation (LDA) model is a Bayesian mixture model for discrete data where topics are assumed to be uncorrelated. It is the most widely used type of topic model for analyses of textual data. LDA is ideally suited for clustering or grouping observations in text documents.
The LDA topic model is packaged in {topicmodels} using estimation and inference algorithms. LDA topic modeling reveals a series of topics, each comprising of keywords with a statistically high probability of co-occurrence, and the proportion of use of each topic within the corpus. The {TopicModelling} package provides supporting functions fpr LDA modeling. The package includes a way of organizing data with a Document Term Matrix (DTM), a matrix that describes the frequency of terms that occur in a collection of bodies of text ("documents). In a DTM, rows represent specific bodies of text, and columns represent keywords. In addition, the package also provides a function for extracting a the most frequent words in a group based on beta scores and thus allows for easy visualization of the results.
The following document will provide an example of topic modeling using the functions present in the "TopicModelling" package to generate themes in a tweet dataset about ballot harvesting, Trump's taxes, and SCOTUS, scraped from Twitter in 2020.
library(tidyverse) library(tidytext) library(topicmodels) library(TopicModelling) tweet_data <- tweets.2020 %>% select(user_id, status_id, screen_name, text, retweet_count, verified)
Within the example data set there are many tweets which consist of irrelevant information that will add unnecessary noise to our data set which we need to remove before starting any analysis.
First, the tweets contain hyperlinks that are not of our interest in this analysis. This step will remove all the URLs using function "str_replace_all" from {stringr} contained in {tidyverse}.
tweet_data$text <- str_replace_all(tweet_data$text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", "")
Then, stop words such as "the", "would", and so forth add no value to our inferred topics. We remove them by building a custom dictionary of stop words and binds it to an already existing list of stop words in tidytext (stop_words).
final_stop <- data.frame(word = c("said", "u", "0001f6a8", "illvi9k4pg", "https", "t.co", "video"), lexicon = "custom") %>% rbind(stop_words)
Finally, for LDA modeling, we need the data to be in a document-term matrix. In doing so, we process our data using cast_dtm() function in our {TopicModelling} package. We tokenized our text into individual words (unnest_tokens), remove "stop" words (mostly prepositions in articles) using the anti_join() function, and count the number of times a word is used per block of text (status_id) using the count() function. Finally, we use our {TopicModelling} cast_dtm function to create a document-term matrix with these specifications.
tweet_dtm <- tweet_data %>% unnest_tokens(word, text) %>% #tokenize the corpus anti_join(final_stop, by = "word") %>% #remove the stop words count(status_id, word) %>% #count the number of times a word is used per status_id TopicModelling::cast_dtm(status_id, word, n) #creates a document-term matrix tweet_dtm
Now we use the LDA() function from the {topicmodels} package to detect the underlying themes a build a LDA topic model. To demonstrate the mode, we subjectively categorize our data into 8 topics (k = 8). We then need to define which model to use, either "VEM" or "Gibbs". Gibbs sampling is a Markov Chain Monte Carlo algorithm for obtaining approximate sampling of the posterior when direct sampling is not possible. While VEM sampling is an iterative expectation algorithm that maximizes the likelihood by minimizing the entropy of the true posterior distribution. Each method may produce slightly different results, for this example we use the VEM method.
To be able to replicate the topic model results in the future, it's important to set a consistent seed. To demonstrate, we set seed = 2021 instead of letting R randomly generate one.
The LDA function may take a while to run via the machine learning model. After getting the results, you can use "save.image(file = "")" to save all the observations including the topic model results and load it next time.
set.seed(2021) tweet_lda <- LDA(tweet_dtm, k = 8, method = "VEM") tweet_lda
The results we get from topic modeling contains a wealth of information. The two most important variables in this model are beta and gamma.
In a topic model, a topic contains a list of words. Each word has a beta value assigned by the LDA algorithm. When a word has a higher beta score, that word matters more to that topic. In our example, when a tweet contain such word, it is more likely to be categorized into the affiliated cluster.
To extract the information of beta value, we use the tidy() function.
topics <- tidy(tweet_lda, matrix="beta")
Then, to demonstrate, we subjectively decide to focus on the top 10 words in the 8 topics. We use the split-apply-combine pattern by clustering the words by topic (“split”), indicating the top 10 words (“apply”), and then use ungroup() to remove the grouping variable (“combine”). Here we utilize the top_n_terms function from our package {TopicModelling} to take the top 10 beta scores from each topic group.
top_terms <- topics %>% group_by(topic) %>% #groups by topic top_n_terms(10, beta) %>% #takes the words with the top 10 beta scores ungroup() #ungroups the topic
To visualize, we use ggplot and illustrate the words in each topic with their associated beta value.
As shown below, the topics differ slightly from each another. The words ranked by beta size provide interpreters a sense of what this topic should be. However, when topics share much similarity, this might suggest that these topics can be combined as we interpret them.
This visualization shows us the most meaningful words in a topic.
top_terms %>% ggplot(aes(x = reorder_within(term, beta, topic, sep = "_"), y = beta, fill = factor(topic))) + geom_bar(stat = 'identity', show.legend = FALSE) + facet_wrap(~ topic, scales = "free") + scale_x_discrete(labels = function(x) gsub("_.+$", "", x)) + coord_flip()
After identifying possible topics, we move onto classifying the documents. For this, we need to look at the gamma matrix.
Like beta, we extract gamma values using the tidy() function.
In this data frame, each row has the document ID (namely, the tweet), the gamma number, and the topic. When a gamma score is higher, this suggests that a document’s content is predominantly in one topic as opposed to another.
topics_doc <- tidy(tweet_lda, matrix="gamma")
Then, to assign a topic to each document, we choose the topic with the highest gamma for each document.
In doing so, we use the slice_max() function in dplyr, which will subset the rows with the largest gamma (per document).
toptopics <- topics_doc %>% group_by(document) %>% slice_max(gamma) colnames(toptopics)[1] <- "status_id" colnames(toptopics)[2] <- "topics" toptopics$status_id <- as.numeric(toptopics$status_id)
Next, we join this data with the full tweet data using full_join(). Importantly, to proceed this process, the document id should be the same id as the status_id in the tweet data frame.
When we ran the LDA function, this id turned into a character, here we need to make sure to turn it back into a numeric, which is how status_id is stored in the tweet_data.
tweet_data2 <- full_join(tweet_data, toptopics, by = "status_id")
Finally, We plot this information to see which topics have been assigned to the most articles.
table(tweet_data2$topics) %>% as.data.frame() %>% ggplot(aes(x = Var1, y = Freq)) + geom_bar(stat = "identity")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.