lda_cgs: LDA: Collapsed Gibbs Sampler with Perplexity Computation

Description Usage Arguments Details Value See Also

Description

This implements of the collapsed Gibbs sampler for the LDA model—a Markov chain on z.

Usage

1
2
3
lda_cgs(num_topics, vocab_size, docs_tf, alpha_h, eta_h, max_iter, burn_in,
  spacing, save_theta, save_beta, save_lp, verbose, test_doc_share = 0,
  test_word_share = 0)

Arguments

num_topics

Number of topics in the corpus

vocab_size

Vocabulary size

docs_tf

A list of corpus documents read from the Blei corpus using read_docs (term indices starts with 0)

alpha_h

Hyperparameter for θ sampling

eta_h

Smoothing parameter for the β matrix

max_iter

Maximum number of Gibbs iterations to be performed

burn_in

Burn-in-period for the Gibbs sampler

spacing

Spacing between the stored samples (to reduce correlation)

save_theta

if 0 the function does not save θ samples

save_beta

if 0 the function does not save β samples

save_lp

if 0 the function does not save computed log posterior for iterations

verbose

from 0, 1, 2

test_doc_share

proportion of the test documents in the corpus. Must be from [0., 1.)

test_word_share

proportion of the test words in each test document. Must be from [0., 1.)

Details

To compute perplexity, we first partition words in a corpus into two sets: (a) a test set (held-out set), which is selected from the set of words in the test (held-out) documents (identified via test_doc_share and test_word_share) and (b) a training set, i.e., the remaining words in the corpus. We then run the variational EM algorithm based on the training set. Finally, we compute per-word perplexity based on the held-out set.

Value

The Markov chain output as a list of

corpus_topic_counts

corpus-level topic counts from last iteration of the Markov chain

theta_counts

document-level topic counts from last iteration of the Markov chain

beta_counts

topic word counts from last iteration of the Markov chain

theta_samples

θ samples after the burn in period, if save_theta is set

beta_samples

β samples after the burn in period, if save_beta is set

log_posterior

the log posterior (upto a constant multiplier) of the hidden variable ψ = (β, θ, z) in the LDA model, if save_lp is set

perplexity

perplexity of the held-out words' set

See Also

Other MCMC: clda_ags_em, clda_ags_sample_alpha, clda_ags, clda_mgs


clintpgeorge/clda documentation built on May 13, 2019, 8 p.m.