To fit topic models with keyATM, uesrs need to prepare texts for keyATM_read()
. keyATM_read()
has three ways to input texts. If you want to check an example with a built-in dataset, please proceed to the next section.
keyATM can read a dfm
object created by quanteda package. Since keyATM does not provide preprocessing functions, we recommend users to preprocess documents with quanteda. Please visit quanteda Quick Start: Creating a Corpus to see how to create a corpus
object.
By making a token
object from a corpus
object, quanteda can perform various preprocessing methods (quanteda Quick Start: Tokenizing texts).
Lastly, you can make a dfm
object with a function dfm()
in quanteda.
Here is an example preprocessing steps with quanteda. In this example, we read texts with readtext package.
library(quanteda) library(readtext) raw_docs <- readtext("PATH_TO_THE_FOLDER/*.txt", encoding = "UTF-8") key_corpus <- corpus(raw_docs, text_field = "text") key_tokens <- tokens(key_corpus, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE, remove_url = TRUE) %>% tokens_tolower() %>% tokens_remove(stopwords("english")) %>% tokens_select(min_nchar = 3) key_dfm <- dfm(key_tokens) %>% dfm_trim(min_termfreq = 5, min_docfreq = 2)
You can also use c(stopwords("english"), "your", "stopwords")
if you want to add corpus specific stopwords.
keyATM_read()
can read data.frame
and tibble
if you preprocess texts without quanteda. Please store texts in a column named text
.
> head(docs) # `docs` stores preprocessed texts # A tibble: 6 x 1 text <chr> 1 h.r h.r one hundred first congress congress congress united u... 2 first congress one congress congress united united state stae... 3 one one one one one one one one one one one one one one one o... 4 h.r h.r one one one hundred hundred first first congress cong... 5 congress congress one united united united united united unit... 6 h.r h.r one one one one one hundred hundred first congress co...
If you have preprocessed text files, you can pass a list of files to keyATM_read()
.
textfiles <- list.files(PATH_TO_THE_FOLDER, pattern = "*.txt", full.names = TRUE)
Before fitting the model, we need to read texts and keywords into keyATM. Here we use quanteda's dfm
.
keyATM package includes a tidy dataset for illustration purpose. In this example, we use ***. You can load this data by data(keyATM_data_bills)
.
library(quanteda) library(keyATM) data(keyATM_data_bills) bills_dfm <- keyATM_data_bills$doc_dfm # quanteda object keyATM_docs <- keyATM_read(bills_dfm)
Keywords should be in a list. For example, we prepare four keyword-topics. Keywrods should be stored in a list. Each element in the list is a character vector, which corresponds to a single keyword-topic. The number of keywords can vary between topics.
bills_keywords <- list( Education = c("education", "child", "student"), Law = c("court", "law", "attorney"), Health = c("public", "health", "program"), Drug = c("drug", "treatment") )
visualize_keywords(keyATM_docs, bills_keywords)
The figure helps you to check the frequency of keywords. Including low-frequency keywords do not help the model in general. keyATM automatically prune keywords that do not appear in the documents.
Now you have texts and keywords! The next step is to fit a model keyATM()
. keyATM has three models:
You can find details in FAQ.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.