optimal_topic: Find the optimal number of topics from a pool of LDA models
In contefranz/OpTop: Optimal topic specification for latent dirichlet allocation models

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/optimal_topic.R

Implements a fast chi-square like test to detect the number of topics estimated via Latent Dirichlet Allocation that best describes the corpus.

1	optimal_topic(lda_models, weighted_dfm, q = 0.8, alpha = 0.05, do_plot = TRUE)

`lda_models`	A list of ordered LDA models as estimated by `LDA`. The LDA models must be in ascending order according to the number of topics.
`weighted_dfm`	A weighted `dfm` containing word proportions. It is recommended that `weighted_dfm` has element names consistent with the ones detected by `LDA`. See 'Details'.
`q`	Set a cutoff for important words as the quantile of the expected cumulative probability of word weights. Default to 0.80, meaning that the function reaches 80% of the distribution mass and leaves out the remaining 20%.
`alpha`	The confidence level of test acceptance. Default to 0.05. See 'Details'.
`do_plot`	Plot the chi-square statistic as a function of the number of topics. Default to `TRUE`.

The function implements a Pearson chi-square statistic that exploits the assumption that the distribution of words is multinomial. The test studies the stability of a K-topic model which fully characterizes the corpus if the observed and estimated word vectors are statistically indistinct.

All internal algorithms are implemented in C and C++ to increase speed and efficiency when highly-dimensional models, together with large weighted DFMs, need to be analyzed.

To ensure a complete matching between the set of LDA models specified through lda_models, we strongly recommend the corresponding weighted_dfm to have specific element names indicating the original names of the documents as defined in the corpus. These element names can be extracted with docid(weighted_dfm). If, for any reason, the function LDA fails to estimate the requested k topics over certain documents, then optimal_topic takes care of that by ensuring that there is a perfect match between the documents found in weighted_dfm and the ones contained in lda_models. If weighted_dfm does not contain any meaningful name to be matched with lda_models, for instance if the whole vector is full of FALSE, then optimal_topic stops with an error because most likely there is something wrong. If optimal_topic finds few documents that are not present in lda_models, then it removes them from the input weighted_dfm in order to achieve a perfect match.

The parameter alpha controls the confidence of the chi-square test. The optimal model is selected the first time the chi-square statistic reaches a p-value equal to alpha. In the event that the chi-square statistic fails to reach alpha, the minimum chi-square statistic is selected. A higher alpha resolves in selecting a model with less topics. You can force the algorithm to find the minimum chi-square statistic by setting alpha equal to zero.

A data.table containing the following columns:

`topic`	An integer giving the number of topics.
`OpTop`	A numeric giving the standardized chi-square.
`pval`	A numeric giving the p-value of the test.

Francesco Grossetti francesco.grossetti@unibocconi.it

Craig M. Lewis craig.lewis@owen.vanderbilt.edu

Lewis, C. and Grossetti, F. (2019 - forthcoming):
A Statistical Approach for Optimal Topic Model Identification.

LDA data.table

## Not run: 
# Compute word proportions from a corpus objects
test1 = optimal_topic( lda_models = lda_list,
                        weighted_dfm = weighted_dfm,
                        q = 0.80,
                        alpha = 0.05 )

## End(Not run)