initialize_topic_counts | R Documentation |
Implementing seeded (or guided) LDA models and transfer learning means that
we can't initialize topics with a uniform-random start. This function prepares
data and then calls a C++ function, create_lexicon
, that runs a single
Gibbs iteration to populate topic counts (and other objects) used during the
main Gibbs sampling run of fit_lda_c
. In the event that
you aren't using fancy seeding or transfer learning, this makes a random
initialization by sampling from Dirichlet distributions parameterized by
priors alpha
and eta
.
initialize_topic_counts(
dtm,
k,
alpha,
eta,
beta_initial = NULL,
theta_initial = NULL,
freeze_topics = FALSE,
threads = 1,
...
)
dtm |
a document term matrix or term co-occurrence matrix of class |
k |
the number of topics |
alpha |
the numeric vector prior for topics over documents as formatted
by |
eta |
the numeric matrix prior for topics over documents as formatted
by |
beta_initial |
if specified, a numeric matrix for the probability of tokens
in topics. Must be specified for predictions or updates as called by
|
theta_initial |
if specified, a numeric matrix for the probability of
topics in documents. Must be specified for updates as called by
|
freeze_topics |
if |
threads |
number of parallel threads, currently unused |
... |
Additional arguments, currently unused |
Returns a list with 5 elements: docs
, Zd
, Cd
, Cv
,
and Ck
. All of these are used by fit_lda_c
.
docs
is a list with one element per document. Each element is a vector
of integers of length sum(dtm[j,])
for the j-th document. The integer
entries correspond to the zero-index column of the dtm
.
Zd
is a list of similar format as docs
. The difference is that
the integer values correspond to the zero-index for topics.
Cd
is a matrix of integers denoting how many times each topic has
been sampled in each document.
Cv
is similar to Cd
but it counts how many times each topic
has been sampled for each token.
Ck
is an integer vector denoting how many times each topic has been
sampled overall.
All of Cd
, Cv
, and Ck
should be derivable by summing
over Zd in various ways.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.