clda_ags: cLDA: Auxiliary Variable Update within Collpased Gibbs...

Description Usage Arguments Details Value Note See Also

Description

This implements a Markov chain on (z, π) via the collapsed Gibbs sampling with auxiliary variable updates for the compound latent Dirichlet allocation (cLDA) model.

Usage

1
2
3
clda_ags(num_topics, vocab_size, docs_cid, docs_tf, alpha_h, gamma_h, eta_h,
  max_iter, burn_in, spacing, save_pi, save_theta, save_beta, save_lp, verbose,
  init_pi, test_doc_share = 0, test_word_share = 0, burn_in_pi = 10L)

Arguments

num_topics

Number of topics in the corpus

vocab_size

Vocabulary size

docs_cid

Collection ID for each document in the corpus (indices starts 0)

docs_tf

Corpus documents read from the Blei corpus format, e.g., via read_docs (indices starts with 0)

alpha_h

Hyperparameter for π

gamma_h

Hyperparameter for θ

eta_h

Hyperparameter for β

max_iter

Maximum number of Gibbs iterations to be performed

burn_in

Burn-in-period for the Gibbs sampler

spacing

Spacing between the stored samples (to reduce correlation)

save_pi

if 0 the function does not save π samples

save_theta

if 0 the function does not save θ samples

save_beta

if 0 the function does not save β samples

save_lp

if 0 the function does not save computed log posterior for iterations

verbose

from 0, 1, 2

init_pi

the initial configuration for the collection level topic mixtures, i.e., π samples

test_doc_share

proportion of the test documents in the corpus. Must be from [0., 1.)

test_word_share

proportion of the test words in each test document. Must be from [0., 1.)

Details

To compute perplexity, we first partition words in a corpus into two sets: (a) a test set (held-out set), which is selected from the set of words in the test (held-out) documents (identified via test_doc_share and test_word_share) and (b) a training set, i.e., the remaining words in the corpus. We then run the variational EM algorithm based on the training set. Finally, we compute per-word perplexity based on the held-out set.

Value

A list of

corpus_topic_counts

corpus-level topic counts from last iteration of the Markov chain

pi_counts

collection-level topic counts from the last iteration of the Markov chain

theta_counts

document-level topic counts from last iteration of the Markov chain

beta_counts

topic word counts from last iteration of the Markov chain

pi_samples

π samples after the burn in period, if save_pi is set

theta_samples

θ samples after the burn in period, if save_theta is set

beta_samples

β samples after the burn in period, if save_beta is set

log_posterior

the log posterior (upto a constant multiplier) of the hidden variable ψ = (β, π, θ, z) in the LDA model, if save_lp is set

log_posterior_pi_z

the log posterior (upto a constant multiplier) of the hidden variables (π, z) in the LDA model, if save_lp is set

perplexity

perplexity of the held-out words' set

Note

Modified on: June 02, 2016

Created on: May 18, 2016

Created by: Clint P. George

See Also

Other MCMC: clda_ags_em, clda_ags_sample_alpha, clda_mgs, lda_cgs


clintpgeorge/clda documentation built on May 13, 2019, 8 p.m.