fit_lda_c: Main C++ Gibbs sampler for Latent Dirichlet Allocation

View source: R/RcppExports.R

fit_lda_cR Documentation

Main C++ Gibbs sampler for Latent Dirichlet Allocation

Description

This is the C++ Gibbs sampler for LDA. "Abandon all hope, ye who enter here."

Usage

fit_lda_c(
  Docs,
  Zd_in,
  Cd_in,
  Cv_in,
  Ck_in,
  alpha_in,
  eta_in,
  iterations,
  burnin,
  optimize_alpha,
  calc_likelihood,
  Beta_in,
  freeze_topics,
  threads = 1L,
  verbose = TRUE
)

Arguments

Docs

List with one element for each document and one entry for each token as formatted by initialize_topic_counts

Zd_in

List with one element for each document and one entry for each token as formatted by initialize_topic_counts

Cd_in

IntegerMatrix denoting counts of topics in documents

Cv_in

IntegerMatrix denoting counts of tokens in topics

Ck_in

IntegerVector denoting counts of topics across all tokens

alpha_in

NumericVector prior for topics over documents

eta_in

NumericMatrix for prior of tokens over topics

iterations

int number of gibbs iterations to run in total

burnin

int number of burn in iterations

optimize_alpha

bool do you want to optimize alpha each iteration?

calc_likelihood

bool do you want to calculate the log likelihood each iteration?

Beta_in

NumericMatrix denoting probability of tokens in topics

freeze_topics

bool if making predictions, set to TRUE

threads

unsigned integer, how many parallel threads? For now, nothing is actually parallel

verbose

bool do you want to print out a progress bar?

Details

Arguments ending in _in are copied and their copies modified in some way by this function. In the case of eta_in and Beta_in, the only modification is that they are converted from matrices to nested std::vector for speed, reliability, and thread safety. In the case of all others, they may be explicitly modified during training.

Value

Returns a list with the following entries.

Cd is a matrix counting the number of times each topic is sampled per document.

Cv is a matrix counting the number of times each topic is sampled per token.

Cd_mean the same as Cd but values averaged across iterations greater than burnin iterations.

Cv_mean the same as Cv but values averaged across iterations greater than burnin iterations.

Cd_sum the same as Cd but values summed across iterations greater than burnin iterations.

Cv_sum the same as Cv but values summed across iterations greater than burnin iterations.

log_likelihood a matrix with one row indexing iterations and one row of the log likelihood for each iteration.

alpha a vector of the document-topic prior

_eta a matrix of the topic-token prior


tidylda documentation built on May 29, 2024, 11:03 a.m.