lda_acgs_st: LDA: Serial Tempering with Perplexity Computation

Description Usage Arguments Value Note See Also

View source: R/RcppExports.R

Description

Implements the LDA serial tempering algorithm. Sampling z_{di}'s is adapted from the idea of collapsed Gibbs sampling chain (Griffiths and Steyvers, 2004). To compute perplexity, it first partitions each document in the corpus into two sets of words: (a) a test set (held-out set) and (b) a training set, given a user defined test_set_share. Then, it runs the Markov chain based on the training set and computes perplexity for the held-out set.

Usage

1
2
3
4
lda_acgs_st(num_topics, vocab_size, docs_tf, h_grid, st_grid, st_grid_nbrs,
  init_st_grid_index, zetas, tuning_iter, max_iter_tuning, max_iter_final,
  burn_in, spacing, test_set_share, save_beta, save_theta, save_lp,
  save_hat_ratios, save_tilde_ratios, verbose)

Arguments

num_topics

Number of topics in the corpus

vocab_size

Vocabulary size

docs_tf

A list of corpus documents read from the Blei corpus using read_docs (term indices starts with 0)

h_grid

A 2-dimensional grid of hyperparameters h = (η, α). It is a 2 x G matrix, where G is the number of grid points and the first row is for α values and the second row is for η values

st_grid

A 2-dimensional grid of hyperparameters h = (η, α). It is a 2 x G matrix, where G is the number of grid points and the first row is for α values and the second row is for η values. This a subgrid on h_grid_ that is used for Serial Tempering

st_grid_nbrs

The neighbor indices, from [0, G-1], of each helper grid point

init_st_grid_index

Index of the helper h grid, from [1, G], of the initial hyperparameter h = (η, α)

zetas

Initial guess for normalization constants

tuning_iter

Number of tuning iterations

max_iter_tuning

Maximum number of Gibbs iterations to be performed for the tuning iterations

max_iter_final

Maximum number of Gibbs iterations to be performed for the final run

burn_in

Burn-in-period for the Gibbs sampler

spacing

Spacing between the stored samples (to reduce correlation)

test_set_share

Proportion of the test words in each document. Must be between 0. and 1.

save_beta

If 0 the function does not save β samples

save_theta

If 0 the function does not save θ samples

save_lp

if 0 The function does not save computed log posterior for iterations

save_hat_ratios

If 0 the function does not save hat ratios for iterations

save_tilde_ratios

If 0 the function does not save tilde ratios for iterations

verbose

Values from 0, 1, 2

Value

A list of

corpus_topic_counts

corpus-level topic counts from last iteration of the Markov chain

theta_counts

document-level topic counts from last iteration of the Markov chain

beta_counts

topic word counts from last iteration of the Markov chain

theta_samples

θ samples after the burn in period, if save_theta is set

beta_samples

β samples after the burn in period, if save_beta is set

log_posterior

the log posterior (upto a constant multiplier) of the hidden variable ψ = (β, θ, z) in the LDA model, if save_lp is set

perplexity

perplexity of the held-out words' set

Note

Modifed on:

October 01, 2016 - Created date, adapated from lda_fgs_st.cpp

See Also

Other MCMC: lda_cgs_em_perplexity, lda_cgs_em, lda_cgs_perplexity, lda_fgs_BF_perplexity, lda_fgs_perplexity, lda_fgs_ppc, lda_fgs_st_perplexity


clintpgeorge/ldamcmc documentation built on Feb. 22, 2020, 12:39 p.m.