summarize_topics: Summarize a topic model consistently across methods/functions

View source: R/utils.R

summarize_topicsR Documentation

Summarize a topic model consistently across methods/functions

Description

Summarizes topics in a model. Called by tidylda and refit.tidylda and used to augment print.tidylda.

Usage

summarize_topics(theta, beta, dtm)

Arguments

theta

numeric matrix whose rows represent P(topic|document)

beta

numeric matrix whose rows represent P(token|topic)

dtm

a document term matrix or term co-occurrence matrix of class dgCMatrix.

Value

Returns a tibble with the following columns: topic is the integer row number of beta. prevalence is the frequency of each topic throughout the corpus it was trained on normalized so that it sums to 100. coherence makes a call to calc_prob_coherence using the default 5 most-probable terms in each topic. top_terms displays the top 5 most-probable terms in each topic.

Note

prevalence should be proportional to P(topic). It is calculated by weighting on document length. So, topics prevalent in longer documents get more weight than topics prevalent in shorter documents. It is calculated by

prevalence <- rowSums(dtm) * theta %>% colSums()

prevalence <- (prevalence * 100) %>% round(3)

An alternative calculation (not implemented here) might have been

prevalence <- colSums(dtm) * t(beta) %>% colSums()

prevalence <- (prevalence * 100) %>% round(3)


tidylda documentation built on July 26, 2023, 5:34 p.m.