initialize_topic_counts: Initialize topic counts for gibbs sampling

View source: R/utils.R

Initialize topic counts for gibbs sampling


Implementing seeded (or guided) LDA models and transfer learning means that we can't initialize topics with a uniform-random start. This function prepares data and then calls a C++ function, create_lexicon, that runs a single Gibbs iteration to populate topic counts (and other objects) used during the main Gibbs sampling run of fit_lda_c. In the event that you aren't using fancy seeding or transfer learning, this makes a random initialization by sampling from Dirichlet distributions parameterized by priors alpha and eta.


  beta_initial = NULL,
  theta_initial = NULL,
  freeze_topics = FALSE,
  threads = 1,



a document term matrix or term co-occurrence matrix of class dgCMatrix.


the number of topics


the numeric vector prior for topics over documents as formatted by format_alpha


the numeric matrix prior for topics over documents as formatted by format_eta


if specified, a numeric matrix for the probability of tokens in topics. Must be specified for predictions or updates as called by predict.tidylda or refit.tidylda respectively.


if specified, a numeric matrix for the probability of topics in documents. Must be specified for updates as called by refit.tidylda


if TRUE does not update counts of tokens in topics. This is TRUE for predictions.


number of parallel threads, currently unused


Additional arguments, currently unused


Returns a list with 5 elements: docs, Zd, Cd, Cv, and Ck. All of these are used by fit_lda_c.

docs is a list with one element per document. Each element is a vector of integers of length sum(dtm[j,]) for the j-th document. The integer entries correspond to the zero-index column of the dtm.

Zd is a list of similar format as docs. The difference is that the integer values correspond to the zero-index for topics.

Cd is a matrix of integers denoting how many times each topic has been sampled in each document.

Cv is similar to Cd but it counts how many times each topic has been sampled for each token.

Ck is an integer vector denoting how many times each topic has been sampled overall.


All of Cd, Cv, and Ck should be derivable by summing over Zd in various ways.

