clda_vem: cLDA: Variational Expectation Maximization

Description Usage Arguments Details Value Note

Description

This implements the Variational Expectation Maximization (EM) algorithm for the compound latent Dirichlet allocation (cLDA) model.

Usage

1
2
3
4
clda_vem(num_topics, vocab_size, docs_cid, docs_tf, alpha_h, gamma_h, eta_h,
  vi_max_iter, em_max_iter, vi_conv_thresh, em_conv_thresh, tau_max_iter,
  tau_step_size, estimate_alpha, estimate_gamma, estimate_eta, verbose, init_pi,
  test_doc_share = 0, test_word_share = 0)

Arguments

num_topics

Number of topics in the corpus

vocab_size

Vocabulary size

docs_cid

Documents collection IDs (ID indices starts 0)

docs_tf

A list of corpus documents read from the Blei corpus using read_docs (term indices starts with 0)

alpha_h

Hyperparameter for collection-level Dirichlets π

gamma_h

Hyperparameter for document-level Dirichlets θ

eta_h

Hyperparameter for corpus level topic Dirichlets β

vi_max_iter

Maximum number of iterations for variational inference

em_max_iter

Maximum number of iterations for variational EM

vi_conv_thresh

Convergence threshold for the document variational inference loop

em_conv_thresh

Convergence threshold for the variational EM loop

tau_max_iter

Maximum number of iterations for the constraint Newton updates of τ

tau_step_size

the step size for the constraint Newton updates of τ

estimate_alpha

If true, run hyperparameter α optimization

estimate_gamma

dummy parameter [not implemented]

estimate_eta

If true, run hyperparameter η optimization

verbose

from 0, 1, 2, 3

init_pi

the initial configuration for the collection level topic mixtures, i.e., π samples

test_doc_share

proportion of the test documents in the corpus. Must be from [0., 1.)

test_word_share

proportion of the test words in each test document. Must be from [0., 1.)

Details

To compute perplexity, we first partition words in a corpus into two sets: (a) a test set (held-out set), which is selected from the set of words in the test (held-out) documents (identified via test_doc_share and test_word_share) and (b) a training set, i.e., the remaining words in the corpus. We then run the variational EM algorithm based on the training set. Finally, we compute per-word perplexity based on the held-out set.

Value

A list of variational parameters

Note

Created on May 13, 2016


clintpgeorge/clda documentation built on May 13, 2019, 8 p.m.