clda_mgs: CLDA: MMALA within Gibbs sampler (with re-parameterization)
In clintpgeorge/clda: Approximate Inference Algorithms for the Compound Latent Dirichlet Allocation Model

Description Usage Arguments Details Value Note See Also

This implements a Markov chain on (z, π) via the Metropolis adjusted Langevin algorithm within Gibbs sampler (MGS) for the compound latent Dirichlet allocation (cLDA) model.

clda_mgs(num_topics, vocab_size, docs_cid, docs_tf, alpha_h, gamma_h, eta_h,
  step_size, max_iter, burn_in, spacing, save_pi, save_theta, save_beta,
  save_lp, verbose, init_pi, test_doc_share = 0, test_word_share = 0,
  burn_in_pi = 10L)

`num_topics`	Number of topics in the corpus
`vocab_size`	Vocabulary size
`docs_cid`	Documents collection IDs (ID indices starts 0)
`docs_tf`	A list of corpus documents read from the Blei corpus using `read_docs` (term indices starts with 0)
`alpha_h`	Hyperparameter for π
`gamma_h`	Hyperparameter for θ
`eta_h`	Smoothing parameter for the β matrix
`step_size`	Step size for Langevin update
`max_iter`	Maximum number of Gibbs iterations to be performed
`burn_in`	Burn-in-period for the Gibbs sampler
`spacing`	Spacing between the stored samples (to reduce correlation)
`save_pi`	if 0 the function does not save π samples
`save_theta`	if 0 the function does not save θ samples
`save_beta`	if 0 the function does not save β samples
`save_lp`	if 0 the function does not save computed log posterior for iterations
`verbose`	from 0, 1, 2
`init_pi`	the initial configuration of the collection level topic mixtures, i.e., π samples
`test_doc_share`	proportion of the test documents in the corpus. Must be from [0., 1.)
`test_word_share`	proportion of the test words in each test document. Must be from [0., 1.)

To compute perplexity, we first partition words in a corpus into two sets: (a) a test set (held-out set), which is selected from the set of words in the test (held-out) documents (identified via test_doc_share and test_word_share) and (b) a training set, i.e., the remaining words in the corpus. We then run the variational EM algorithm based on the training set. Finally, we compute per-word perplexity based on the held-out set.

The Markov chain output as a list of

`corpus_topic_counts`	corpus-level topic counts from last iteration of the Markov chain
`pi_counts`	collection-level topic counts from the last iteration of the Markov chain
`theta_counts`	document-level topic counts from last iteration of the Markov chain
`beta_counts`	topic word counts from last iteration of the Markov chain
`pi_samples`	π samples after the burn in period, if `save_pi` is set
`theta_samples`	θ samples after the burn in period, if `save_theta` is set
`beta_samples`	β samples after the burn in period, if `save_beta` is set
`log_posterior`	the log posterior (upto a constant multiplier) of the hidden variable ψ = (β, π, θ, z) in the LDA model, if `save_lp` is set
`log_posterior_pi_z`	the log posterior (upto a constant multiplier) of the hidden variables (π, z) in the LDA model, if `save_lp` is set
`perplexity`	perplexity of the held-out words' set