Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/optimal_topic.R
Implements a fast chi-square like test to detect the number of topics estimated via Latent Dirichlet Allocation that best describes the corpus.
1 | optimal_topic(lda_models, weighted_dfm, q = 0.8, alpha = 0.05, do_plot = TRUE)
|
lda_models |
A list of ordered LDA models as estimated by
|
weighted_dfm |
A weighted |
q |
Set a cutoff for important words as the quantile of the expected cumulative probability of word weights. Default to 0.80, meaning that the function reaches 80% of the distribution mass and leaves out the remaining 20%. |
alpha |
The confidence level of test acceptance. Default to 0.05. See 'Details'. |
do_plot |
Plot the chi-square statistic as a function of the number of
topics. Default to |
The function implements a Pearson chi-square statistic that exploits the assumption that the distribution of words is multinomial. The test studies the stability of a K-topic model which fully characterizes the corpus if the observed and estimated word vectors are statistically indistinct.
All internal algorithms are implemented in C
and C++
to increase speed and efficiency
when highly-dimensional models, together with large weighted DFMs, need to be analyzed.
To ensure a complete matching between the set of LDA models specified
through lda_models
, we strongly recommend the corresponding weighted_dfm
to have specific element names indicating the original names of the documents as defined
in the corpus
. These element names can be extracted with
docid
(weighted_dfm)
.
If, for any reason, the function LDA
fails
to estimate the requested k
topics over certain documents, then optimal_topic
takes care of that by ensuring that there is a perfect match between the documents found in
weighted_dfm
and the ones contained in lda_models
.
If weighted_dfm
does not contain any meaningful name to be matched with lda_models
,
for instance if the whole vector is full of FALSE
, then optimal_topic
stops
with an error because most likely there is something wrong. If optimal_topic
finds
few documents that are not present in lda_models
, then it removes them from the input
weighted_dfm
in order to achieve a perfect match.
The parameter alpha
controls the confidence of the chi-square test. The
optimal model is selected the first time the chi-square statistic reaches
a p-value equal to alpha
. In the event that the chi-square statistic
fails to reach alpha
, the minimum chi-square statistic
is selected. A higher alpha
resolves in selecting a model with less
topics. You can force the algorithm to find the minimum chi-square statistic
by setting alpha
equal to zero.
A data.table
containing the following columns:
|
An integer giving the number of topics. |
|
A numeric giving the standardized chi-square. |
|
A numeric giving the p-value of the test. |
Francesco Grossetti francesco.grossetti@unibocconi.it
Craig M. Lewis craig.lewis@owen.vanderbilt.edu
Lewis, C. and Grossetti, F. (2019 - forthcoming):
A Statistical Approach for Optimal Topic Model Identification.
1 2 3 4 5 6 7 8 | ## Not run:
# Compute word proportions from a corpus objects
test1 = optimal_topic( lda_models = lda_list,
weighted_dfm = weighted_dfm,
q = 0.80,
alpha = 0.05 )
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.