lda_cgs_em_perplexity: LDA: Gibbs-EM with Perplexity Computation

Description Usage Arguments Details Value See Also

View source: R/RcppExports.R

Description

This implements the Gibbs-EM algorithm for LDA that is mentioned in the paper Topic Modeling: Beyond Bag-of-Words. Wallach (2006).

Usage

1
2
3
lda_cgs_em_perplexity(num_topics, vocab_size, docs_tf, alpha_h, eta_h,
  em_max_iter, gibbs_max_iter, burn_in, spacing, save_theta, save_beta, save_lp,
  verbose, test_set_share)

Arguments

num_topics

Number of topics in the corpus

vocab_size

Vocabulary size

docs_tf

A list of corpus documents read from the Blei corpus using read_docs (term indices starts with 0)

alpha_h

Hyperparameter for θ sampling

eta_h

Smoothing parameter for the β matrix

em_max_iter

Maximum number of EM iterations to be performed

gibbs_max_iter

Maximum number of Gibbs iterations to be performed

burn_in

Burn-in-period for the Gibbs sampler

spacing

Spacing between the stored samples (to reduce correlation)

save_theta

if 0 the function does not save θ samples

save_beta

if 0 the function does not save β samples

save_lp

if 0 the function does not save computed log posterior for iterations

verbose

from 0, 1, 2

test_set_share

proportion of the test words in each document. Must be between 0. and 1.

Details

It uses the LDA collapsed Gibbs sampler—a Markov chain on z for the E-step, and Minka (2003) fixed point iterations to optimize h = (η, α) in the M-step. To compute perplexity, it first partitions each document in the corpus into two sets of words: (a) a test set (held-out set) and (b) a training set, given a user defined test_set_share. Then, it runs the Markov chain based on the training set and computes perplexity for the held-out set.

Value

The Markov chain output as a list of

corpus_topic_counts

corpus-level topic counts from last iteration of the Markov chain

theta_counts

document-level topic counts from last iteration of the Markov chain

beta_counts

topic word counts from last iteration of the Markov chain

theta_samples

θ samples after the burn in period, if save_theta is set

beta_samples

β samples after the burn in period, if save_beta is set

log_posterior

the log posterior (upto a constant multiplier) of the hidden variable ψ = (β, θ, z) in the LDA model, if save_lp is set

perplexity

perplexity of the held-out words' set

See Also

Other MCMC: lda_acgs_st, lda_cgs_em, lda_cgs_perplexity, lda_fgs_BF_perplexity, lda_fgs_perplexity, lda_fgs_ppc, lda_fgs_st_perplexity


clintpgeorge/ldamcmc documentation built on Feb. 22, 2020, 12:39 p.m.