Functions to Fit LDAtype models
Description
These functions use a collapsed Gibbs sampler to fit three different models: latent Dirichlet allocation (LDA), the mixedmembership stochastic blockmodel (MMSB), and supervised LDA (sLDA). These functions take sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling. Multinomial logit for sLDA is supported using the multinom function from nnet package .
Usage
1 2 3 4 5 6 7 8 9 10 11 12  lda.collapsed.gibbs.sampler(documents, K, vocab, num.iterations, alpha,
eta, initial = NULL, burnin = NULL, compute.log.likelihood = FALSE,
trace = 0L, freeze.topics = FALSE)
slda.em(documents, K, vocab, num.e.iterations, num.m.iterations, alpha,
eta, annotations, params, variance, logistic = FALSE, lambda = 10,
regularise = FALSE, method = "sLDA", trace = 0L, MaxNWts=3000)
mmsb.collapsed.gibbs.sampler(network, K, num.iterations, alpha,
beta.prior, initial = NULL, burnin = NULL, trace = 0L)
lda.cvb0(documents, K, vocab, num.iterations, alpha, eta, trace = 0L)

Arguments
documents 
A list whose length is equal to the number of documents, D. Each element of documents is an integer matrix with two rows. Each column of documents[[i]] (i.e., document i) represents a word occurring in the document. documents[[i]][1, j] is a 0indexed word identifier for the jth word in document i. That is, this should be an index  1 into vocab. documents[[i]][2, j] is an integer specifying the number of times that word appears in the document. 
network 
For 
K 
An integer representing the number of topics in the model. 
vocab 
A character vector specifying the vocabulary words associated with the word indices used in documents. 
num.iterations 
The number of sweeps of Gibbs sampling over the entire corpus to make. 
num.e.iterations 
For 
num.m.iterations 
For 
alpha 
The scalar value of the Dirichlet hyperparameter for topic proportions. 
beta.prior 
For 
eta 
The scalar value of the Dirichlet hyperparamater for topic multinomials. 
initial 
A list of initial topic assignments for words. It should be in the same format as the assignments field of the return value. If this field is NULL, then the sampler will be initialized with random assignments. 
burnin 
A scalar integer indicating the number of Gibbs sweeps to consider
as burnin (i.e., throw away) for 
compute.log.likelihood 
A scalar logical which when 
annotations 
A length D numeric vector of covariates associated with each
document. Only used by 
params 
For 
variance 
For 
logistic 
For 
lambda 
When regularise is 
regularise 
When 
method 
For 
trace 
When 
MaxNWts 
Input to the nnet's multinom function with a default value of 3000 maximum weights. Increasing this value may be necessary when using logistic sLDA with a large number of topics at the necessary expense of longer run times. 
freeze.topics 
When 
Value
A fitted model as a list with the following components:
assignments 
A list of length D. Each element of the list, say

topics 
A K \times V matrix where each entry indicates the number of times a word (column) was assigned to a topic (row). The column names should correspond to the vocabulary words given in vocab. 
topic_sums 
A length K vector where each entry indicates the total number of times words were assigned to each topic. 
document_sums 
A K \times D matrix where each entry is an integer indicating the number of times words in each document (column) were assigned to each topic (column). 
log.likelihoods 
Only for 
document_expects 
This field only exists if burnin is nonNULL. This field is like document_sums but instead of only aggregating counts for the last iteration, this field aggegates counts over all iterations after burnin. 
net.assignments.left 
Only for

net.assignments.right 
Only for

blocks.neg 
Only for

blocks.pos 
Only for

model 
For 
coefs 
For 
Note
WARNING: This function does not compute precisely the correct thing when the count associated with a word in a document is not 1 (this is for speed reasons currently). A workaround when a word appears multiple times is to replicate the word across several columns of a document. This will likely be fixed in a future version.
Author(s)
Jonathan Chang (slycoder@gmail.com)
References
Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.
Airoldi , Edoardo M. and Blei, David M. and Fienberg, Stephen E. and Xing, Eric P. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 2008.
Blei, David M. and McAuliffe, John. Supervised topic models. Advances in Neural Information Processing Systems, 2008.
Griffiths, Thomas L. and Steyvers, Mark. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004.
Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. On smoothing and inference for topic models. Uncertainty in Artificial Intelligence, 2009.
See Also
read.documents
and lexicalize
can be used
to generate the input data to these models.
top.topic.words
,
predictive.distribution
, and slda.predict
for operations on the fitted models.
Examples
1 2 3 4 5 6 7  ## See demos for the three functions:
## Not run: demo(lda)
## Not run: demo(slda)
## Not run: demo(mmsb)
