stm_tidiers: Tidiers for Structural Topic Models from the stm package
In tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

stm_tidiers

R Documentation

Tidiers for Structural Topic Models from the stm package

Description

Tidy topic models fit by the stm package. The arguments and return values are similar to lda_tidiers().

Usage

## S3 method for class 'STM'
tidy(
  x,
  matrix = c("beta", "gamma", "theta", "frex", "lift"),
  log = FALSE,
  document_names = NULL,
  ...
)

## S3 method for class 'estimateEffect'
tidy(x, ...)

## S3 method for class 'estimateEffect'
glance(x, ...)

## S3 method for class 'STM'
augment(x, data, ...)

## S3 method for class 'STM'
glance(x, ...)

Arguments

`x`	An STM fitted model object from either `stm::stm()` or `stm::estimateEffect()`
`matrix`	Which matrix to tidy: the beta matrix (per-term-per-topic, default) the gamma/theta matrix (per-document-per-topic); the stm package calls this the theta matrix, but other topic modeling packages call this gamma the FREX matrix, for words with high frequency and exclusivity the lift matrix, for words with high lift
`log`	Whether beta/gamma/theta should be on a log scale, default FALSE
`document_names`	Optional vector of document names for use with per-document-per-topic tidying
`...`	Extra arguments for tidying, such as `w` as used in `stm::calcfrex()`
`data`	For `augment`, the data given to the stm function, either as a `dfm` from quanteda or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta, gamma, FREX, or lift matrix if called on an object from stm::stm(), or a tidied version of the estimated regressions if called on an object from stm::estimateEffect().

glance returns a tibble with exactly one row of model summaries.

augment must be provided a data argument, either a dfm from quanteda or a table containing one row per original document-term pair, such as is returned by tdm_tidiers, containing columns document and term. It returns that same data with an additional column .topic with the topic assignment for that document-term combination.

Examples


library(dplyr)
library(ggplot2)
library(stm)
library(janeaustenr)

austen_sparse <- austen_books() %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(book, word) %>%
    cast_sparse(book, word, n)
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE)

# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta

# Examine the topics
td_beta %>%
    group_by(topic) %>%
    slice_max(beta, n = 10) %>%
    ungroup() %>%
    ggplot(aes(beta, term)) +
    geom_col() +
    facet_wrap(~ topic, scales = "free")

# high FREX words per topic
tidy(topic_model, matrix = "frex")

# high lift words per topic
tidy(topic_model, matrix = "lift")

# tidy the document-topic combinations, with optional document names
td_gamma <- tidy(topic_model, matrix = "gamma",
                 document_names = rownames(austen_sparse))
td_gamma

# using stm's gardarianFit, we can tidy the result of a model
# estimated with covariates
effects <- estimateEffect(1:3 ~ treatment, gadarianFit, gadarian)
glance(effects)
td_estimate <- tidy(effects)
td_estimate

tidytext documentation built on May 29, 2024, 5:42 a.m.