initialize_topic_counts: Initialize topic counts for gibbs sampling
In tidylda: Latent Dirichlet Allocation Using 'tidyverse' Conventions

initialize_topic_counts

R Documentation

Initialize topic counts for gibbs sampling

Description

Implementing seeded (or guided) LDA models and transfer learning means that we can't initialize topics with a uniform-random start. This function prepares data and then calls a C++ function, create_lexicon, that runs a single Gibbs iteration to populate topic counts (and other objects) used during the main Gibbs sampling run of fit_lda_c. In the event that you aren't using fancy seeding or transfer learning, this makes a random initialization by sampling from Dirichlet distributions parameterized by priors alpha and eta.

Usage

initialize_topic_counts(
  dtm,
  k,
  alpha,
  eta,
  beta_initial = NULL,
  theta_initial = NULL,
  freeze_topics = FALSE,
  threads = 1,
  ...
)

Arguments

`dtm`	a document term matrix or term co-occurrence matrix of class `dgCMatrix`.
`k`	the number of topics
`alpha`	the numeric vector prior for topics over documents as formatted by `format_alpha`
`eta`	the numeric matrix prior for topics over documents as formatted by `format_eta`
`beta_initial`	if specified, a numeric matrix for the probability of tokens in topics. Must be specified for predictions or updates as called by `predict.tidylda` or `refit.tidylda` respectively.
`theta_initial`	if specified, a numeric matrix for the probability of topics in documents. Must be specified for updates as called by `refit.tidylda`
`freeze_topics`	if `TRUE` does not update counts of tokens in topics. This is `TRUE` for predictions.
`threads`	number of parallel threads, currently unused
`...`	Additional arguments, currently unused

Value

Returns a list with 5 elements: docs, Zd, Cd, Cv, and Ck. All of these are used by fit_lda_c.

docs is a list with one element per document. Each element is a vector of integers of length sum(dtm[j,]) for the j-th document. The integer entries correspond to the zero-index column of the dtm.

Zd is a list of similar format as docs. The difference is that the integer values correspond to the zero-index for topics.

Cd is a matrix of integers denoting how many times each topic has been sampled in each document.

Cv is similar to Cd but it counts how many times each topic has been sampled for each token.

Ck is an integer vector denoting how many times each topic has been sampled overall.

Note

All of Cd, Cv, and Ck should be derivable by summing over Zd in various ways.

tidylda documentation built on May 29, 2024, 11:03 a.m.

tidylda index

Package overview README.md Introduction to tidylda Probabilistic Coherence Transfer Learning with LDA (tLDA)

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tidylda
Latent Dirichlet Allocation Using 'tidyverse' Conventions

initialize_topic_counts: Initialize topic counts for gibbs sampling
In tidylda: Latent Dirichlet Allocation Using 'tidyverse' Conventions

Initialize topic counts for gibbs sampling

Description

Usage

Arguments

Value

Note

Related to initialize_topic_counts in tidylda...

R Package Documentation

Browse R Packages

We want your feedback!

tidylda Latent Dirichlet Allocation Using 'tidyverse' Conventions

initialize_topic_counts: Initialize topic counts for gibbs sampling In tidylda: Latent Dirichlet Allocation Using 'tidyverse' Conventions

Initialize topic counts for gibbs sampling

Description

Usage

Arguments

Value

Note

Related to initialize_topic_counts in tidylda...

R Package Documentation

Browse R Packages

We want your feedback!

tidylda
Latent Dirichlet Allocation Using 'tidyverse' Conventions

initialize_topic_counts: Initialize topic counts for gibbs sampling
In tidylda: Latent Dirichlet Allocation Using 'tidyverse' Conventions