summarize_topics: Summarize a topic model consistently across methods/functions
In tidylda: Latent Dirichlet Allocation Using 'tidyverse' Conventions

summarize_topics

R Documentation

Summarize a topic model consistently across methods/functions

Description

Summarizes topics in a model. Called by tidylda and refit.tidylda and used to augment print.tidylda.

Usage

summarize_topics(theta, beta, dtm)

Arguments

`theta`	numeric matrix whose rows represent P(topic\|document)
`beta`	numeric matrix whose rows represent P(token\|topic)
`dtm`	a document term matrix or term co-occurrence matrix of class `dgCMatrix`.

Value

Returns a tibble with the following columns: topic is the integer row number of beta. prevalence is the frequency of each topic throughout the corpus it was trained on normalized so that it sums to 100. coherence makes a call to calc_prob_coherence using the default 5 most-probable terms in each topic. top_terms displays the top 5 most-probable terms in each topic.

Note

prevalence should be proportional to P(topic). It is calculated by weighting on document length. So, topics prevalent in longer documents get more weight than topics prevalent in shorter documents. It is calculated by

prevalence <- rowSums(dtm) * theta %>% colSums()

prevalence <- (prevalence * 100) %>% round(3)

An alternative calculation (not implemented here) might have been

prevalence <- colSums(dtm) * t(beta) %>% colSums()

prevalence <- (prevalence * 100) %>% round(3)

tidylda documentation built on May 29, 2024, 11:03 a.m.