clda_ags_sample_alpha: cLDA: Auxiliary Variable Update within Collpased Gibbs...
In clintpgeorge/clda: Approximate Inference Algorithms for the Compound Latent Dirichlet Allocation Model

Description Usage Arguments Details Value Note See Also

This implements a Markov chain on (z, π) via the collapsed Gibbs sampling with auxiliary variable updates for the compound latent Dirichlet allocation (cLDA) model.

clda_ags_sample_alpha(num_topics, vocab_size, docs_cid, docs_tf, alpha_h,
  gamma_h, eta_h, max_iter, burn_in, spacing, save_pi, save_theta, save_beta,
  save_lp, verbose, init_pi, test_doc_share = 0, test_word_share = 0,
  burn_in_pi = 10L, sample_alpha_h = FALSE, gamma_shape = 1,
  gamma_rate = 1)

`num_topics`	Number of topics in the corpus
`vocab_size`	Vocabulary size
`docs_cid`	Collection ID for each document in the corpus (indices starts 0)
`docs_tf`	Corpus documents read from the Blei corpus format, e.g., via `read_docs` (indices starts with 0)
`alpha_h`	Hyperparameter for π. When `sample_alpha_h` is `true` this variable is used to initialize hyperparameter α
`gamma_h`	Hyperparameter for θ
`eta_h`	Hyperparameter for β
`max_iter`	Maximum number of Gibbs iterations to be performed
`burn_in`	Burn-in-period for the Gibbs sampler
`spacing`	Spacing between the stored samples (to reduce correlation)
`save_pi`	if 0 the function does not save π samples
`save_theta`	if 0 the function does not save θ samples
`save_beta`	if 0 the function does not save β samples
`save_lp`	if 0 the function does not save computed log posterior for iterations
`verbose`	from 0, 1, 2
`init_pi`	the initial configuration for the collection level topic mixtures, i.e., π samples
`test_doc_share`	proportion of the test documents in the corpus. Must be from [0., 1.)
`test_word_share`	proportion of the test words in each test document. Must be from [0., 1.)
`burn_in_pi`	burn in iterations until pi sampling
`sample_alpha_h`	sample hyperparameter α (true) or not (false)
`gamma_shape`	hyperparameter `shape` for the Gamma prior on α. Default is 1.
`gamma_rate`	hyperparameter `rate` for the Gamma prior on α. Default is 1.

To compute perplexity, we first partition words in a corpus into two sets: (a) a test set (held-out set), which is selected from the set of words in the test (held-out) documents (identified via test_doc_share and test_word_share) and (b) a training set, i.e., the remaining words in the corpus. We then run the variational EM algorithm based on the training set. Finally, we compute per-word perplexity based on the held-out set.

A list of

`corpus_topic_counts`	corpus-level topic counts from last iteration of the Markov chain
`pi_counts`	collection-level topic counts from the last iteration of the Markov chain
`theta_counts`	document-level topic counts from last iteration of the Markov chain
`beta_counts`	topic word counts from last iteration of the Markov chain
`pi_samples`	π samples after the burn in period, if `save_pi` is set
`theta_samples`	θ samples after the burn in period, if `save_theta` is set
`beta_samples`	β samples after the burn in period, if `save_beta` is set
`log_posterior`	the log posterior (upto a constant multiplier) of the hidden variable ψ = (β, π, θ, z) in the LDA model, if `save_lp` is set
`log_posterior_pi_z`	the log posterior (upto a constant multiplier) of the hidden variables (π, z) in the LDA model, if `save_lp` is set
`perplexity`	perplexity of the held-out words' set