initialize_topic_counts: Initialize topic counts for gibbs sampling

View source: R/utils.R

initialize_topic_countsR Documentation

Initialize topic counts for gibbs sampling

Description

Implementing seeded (or guided) LDA models and transfer learning means that we can't initialize topics with a uniform-random start. This function prepares data and then calls a C++ function, create_lexicon, that runs a single Gibbs iteration to populate topic counts (and other objects) used during the main Gibbs sampling run of fit_lda_c. In the event that you aren't using fancy seeding or transfer learning, this makes a random initialization by sampling from Dirichlet distributions parameterized by priors alpha and eta.

Usage

initialize_topic_counts(
  dtm,
  k,
  alpha,
  eta,
  beta_initial = NULL,
  theta_initial = NULL,
  freeze_topics = FALSE,
  threads = 1,
  ...
)

Arguments

dtm

a document term matrix or term co-occurrence matrix of class dgCMatrix.

k

the number of topics

alpha

the numeric vector prior for topics over documents as formatted by format_alpha

eta

the numeric matrix prior for topics over documents as formatted by format_eta

beta_initial

if specified, a numeric matrix for the probability of tokens in topics. Must be specified for predictions or updates as called by predict.tidylda or refit.tidylda respectively.

theta_initial

if specified, a numeric matrix for the probability of topics in documents. Must be specified for updates as called by refit.tidylda

freeze_topics

if TRUE does not update counts of tokens in topics. This is TRUE for predictions.

threads

number of parallel threads, currently unused

...

Additional arguments, currently unused

Value

Returns a list with 5 elements: docs, Zd, Cd, Cv, and Ck. All of these are used by fit_lda_c.

docs is a list with one element per document. Each element is a vector of integers of length sum(dtm[j,]) for the j-th document. The integer entries correspond to the zero-index column of the dtm.

Zd is a list of similar format as docs. The difference is that the integer values correspond to the zero-index for topics.

Cd is a matrix of integers denoting how many times each topic has been sampled in each document.

Cv is similar to Cd but it counts how many times each topic has been sampled for each token.

Ck is an integer vector denoting how many times each topic has been sampled overall.

Note

All of Cd, Cv, and Ck should be derivable by summing over Zd in various ways.


tidylda documentation built on July 26, 2023, 5:34 p.m.