knitr::opts_chunk$set(
  echo = FALSE,
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  cache = FALSE,
  comment = "#>"
)
library(topicsplorrr)
library(dplyr)
library(lubridate)
library(extrafont)

loadfonts()

# use quanteda's inaugural presidential speeches as sampel data
sample_docs <- quanteda::convert(quanteda::data_corpus_inaugural,
                                        to = "data.frame") %>%
  mutate(Year = lubridate::as_date(paste(Year, "-01-20", sep = "")))

"Naive" topic analysis

The package primarily supports provides functions to analyze and visualize the results of an stm topical model, but includes also functions to explore simple term trends in a text corpus. This can be thought of as a first "naive" topical analysis and is a useful initial exploration of a text collection.

All examples in the following sections are using a snapshot of the [*'SARS-CoV-2 COVID-19'* preprint collection](https://connect.medrxiv.org/relate/content/181) on [medrXiv](https://www.medrxiv.org/) and [biorXiv](https://www.biorxiv.org/); this data is available via the [`covid19topics` package](https://github.com/sdaume/covid19topics) which has been setup for a topical analysis.

All examples in the following sections are based on a sample dataset from the quanteda package; the corresponding topic model data has been added as package datasets.

For the following analyses common English stop words were typically removed from the source texts, as are numbers and words were "stemmed"^[I.e. all words are reduced to their word stem and words like "model", "models", "modelling", "modeling" are transformed to the word stem "model"].

The following table utilizes the package's term_counts() function to list the most frequent terms in the analyzed preprint titles.

library(topicsplorrr)
library(dplyr)

# use quanteda's inaugural presidential speeches as sampel data
sample_docs <- quanteda::convert(quanteda::data_corpus_inaugural,
                                        to = "data.frame") %>%
  mutate(Year = lubridate::as_date(paste(Year, "-01-20", sep = "")))

# extract unigrams
processed_terms <- unigrams_by_date(textData = sample_docs, 
                                    textColumn = "text", 
                                    dateColumn = "Year")

# and compute term shares
top_title_terms <- term_counts(processed_terms) %>%
  slice(1:10) %>%
  mutate(term_share = term_share*100)

top_title_terms %>% 
  kableExtra::kbl(format = "html", 
                  caption = "Most frequent terms in the sample documents",
                  col.names = c("Term", "N", "%"),
                  digits = c(0,0,2)) 

Term trends

The variation of term shares over time can be the next step in exploring the basic "topical" trends in a text collection. The following code extracts the bigram term counts in sample documents and then plots the temporal frequencies for the 25 most frequent bigrams.

The variation of term shares over time can be the next step in exploring the basic "topical" trends in a text collection. The following code extracts the bigram term counts in the titles of  [*'SARS-CoV-2 COVID-19'* preprints](https://connect.medrxiv.org/relate/content/181) and then plots the weekly shares for the 25 most frequent bigrams. 

The code below utilizes the package's plot_term_frequencies() function. A linear trend line is added as a simple visual aid to distinguish terms with increasing, decreasing or stable trends. The example shows monthly shares of bigrams and demonstrates some additional options such as specifying additional stop words (here the most common terms in the preprint titles).

library(topicsplorrr)

terms_by_date(textData = sample_docs,
              textColumn = "text", 
              dateColumn = "Year",
              #customStopwords = c("america", "government"),
              tokenType = "bigram") %>%
  plot_term_frequencies(timeBinUnit = "year", topN = 25, nCols = 5,
                        minTermTimeBins = 0)


sdaume/topicsplorrr documentation built on Dec. 22, 2021, 11:11 p.m.