clda_ags_em: cLDA: Auxiliary Variable Update within Collpased Gibbs...

Description Usage Arguments Details Value Note See Also

Description

This implements a Markov chain on (z, π) via the collapsed Gibbs sampling with auxiliary variable updates for the compound latent Dirichlet allocation (cLDA) model.

Usage

1
2
3
4
5
clda_ags_em(num_topics, vocab_size, docs_cid, docs_tf, alpha_h, gamma_h, eta_h,
  em_max_iter, gibbs_max_iter, burn_in, spacing, save_pi, save_theta, save_beta,
  save_lp, verbose, init_pi, test_doc_share = 0, test_word_share = 0,
  burn_in_pi = 10L, sample_alpha_h = FALSE, gamma_shape = 1,
  gamma_rate = 1)

Arguments

num_topics

Number of topics in the corpus

vocab_size

Vocabulary size

docs_cid

Collection ID for each document in the corpus (indices starts 0)

docs_tf

Corpus documents read from the Blei corpus format, e.g., via read_docs (indices starts with 0)

alpha_h

Hyperparameter for π. When sample_alpha_h is true this variable is used to initialize hyperparameter α

gamma_h

Hyperparameter for θ

eta_h

Hyperparameter for β

em_max_iter

Maximum number of EM iterations to be performed

gibbs_max_iter

Maximum number of Gibbs iterations to be performed

burn_in

Burn-in-period for the Gibbs sampler

spacing

Spacing between the stored samples (to reduce correlation)

save_pi

if 0 the function does not save π samples

save_theta

if 0 the function does not save θ samples

save_beta

if 0 the function does not save β samples

save_lp

if 0 the function does not save computed log posterior for iterations

verbose

from 0, 1, 2

init_pi

the initial configuration for the collection level topic mixtures, i.e., π samples

test_doc_share

proportion of the test documents in the corpus. Must be from [0., 1.)

test_word_share

proportion of the test words in each test document. Must be from [0., 1.)

burn_in_pi

burn in iterations until pi sampling

sample_alpha_h

sample hyperparameter α (true) or not (false)

gamma_shape

hyperparameter shape for the Gamma prior on α. Default is 1.

gamma_rate

hyperparameter rate for the Gamma prior on α. Default is 1.

Details

To compute perplexity, we first partition words in a corpus into two sets: (a) a test set (held-out set), which is selected from the set of words in the test (held-out) documents (identified via test_doc_share and test_word_share) and (b) a training set, i.e., the remaining words in the corpus. We then run the variational EM algorithm based on the training set. Finally, we compute per-word perplexity based on the held-out set.

Value

A list of

corpus_topic_counts

corpus-level topic counts from last iteration of the Markov chain

pi_counts

collection-level topic counts from the last iteration of the Markov chain

theta_counts

document-level topic counts from last iteration of the Markov chain

beta_counts

topic word counts from last iteration of the Markov chain

pi_samples

π samples after the burn in period, if save_pi is set

theta_samples

θ samples after the burn in period, if save_theta is set

beta_samples

β samples after the burn in period, if save_beta is set

log_posterior

the log posterior (upto a constant multiplier) of the hidden variable ψ = (β, π, θ, z) in the LDA model, if save_lp is set

log_posterior_pi_z

the log posterior (upto a constant multiplier) of the hidden variables (π, z) in the LDA model, if save_lp is set

perplexity

perplexity of the set of held-out words

alpha_h_samples

α samples if sample_alpha_h is true

gamma_h_estimates

γ estimates from each EM iteration

eta_h_estimates

η estimates from each EM iteration

Note

Updated on: December 18, 2017 – Added hyperparameter alpha sampling and AGS EM updates

Updated on: June 02, 2016

Created on: May 18, 2016

Created by: Clint P. George

See Also

Other MCMC: clda_ags_sample_alpha, clda_ags, clda_mgs, lda_cgs


clintpgeorge/clda documentation built on May 13, 2019, 8 p.m.