optimal_topic: Find the optimal number of topics from a pool of LDA models

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/optimal_topic.R

Description

Implements a fast chi-square like test to detect the number of topics estimated via Latent Dirichlet Allocation that best describes the corpus.

Usage

1
optimal_topic(lda_models, weighted_dfm, q = 0.8, alpha = 0.05, do_plot = TRUE)

Arguments

lda_models

A list of ordered LDA models as estimated by LDA. The LDA models must be in ascending order according to the number of topics.

weighted_dfm

A weighted dfm containing word proportions. It is recommended that weighted_dfm has element names consistent with the ones detected by LDA. See 'Details'.

q

Set a cutoff for important words as the quantile of the expected cumulative probability of word weights. Default to 0.80, meaning that the function reaches 80% of the distribution mass and leaves out the remaining 20%.

alpha

The confidence level of test acceptance. Default to 0.05. See 'Details'.

do_plot

Plot the chi-square statistic as a function of the number of topics. Default to TRUE.

Details

The function implements a Pearson chi-square statistic that exploits the assumption that the distribution of words is multinomial. The test studies the stability of a K-topic model which fully characterizes the corpus if the observed and estimated word vectors are statistically indistinct.

All internal algorithms are implemented in C and C++ to increase speed and efficiency when highly-dimensional models, together with large weighted DFMs, need to be analyzed.

To ensure a complete matching between the set of LDA models specified through lda_models, we strongly recommend the corresponding weighted_dfm to have specific element names indicating the original names of the documents as defined in the corpus. These element names can be extracted with docid(weighted_dfm). If, for any reason, the function LDA fails to estimate the requested k topics over certain documents, then optimal_topic takes care of that by ensuring that there is a perfect match between the documents found in weighted_dfm and the ones contained in lda_models. If weighted_dfm does not contain any meaningful name to be matched with lda_models, for instance if the whole vector is full of FALSE, then optimal_topic stops with an error because most likely there is something wrong. If optimal_topic finds few documents that are not present in lda_models, then it removes them from the input weighted_dfm in order to achieve a perfect match.

The parameter alpha controls the confidence of the chi-square test. The optimal model is selected the first time the chi-square statistic reaches a p-value equal to alpha. In the event that the chi-square statistic fails to reach alpha, the minimum chi-square statistic is selected. A higher alpha resolves in selecting a model with less topics. You can force the algorithm to find the minimum chi-square statistic by setting alpha equal to zero.

Value

A data.table containing the following columns:

topic

An integer giving the number of topics.

OpTop

A numeric giving the standardized chi-square.

pval

A numeric giving the p-value of the test.

Author(s)

Francesco Grossetti francesco.grossetti@unibocconi.it

Craig M. Lewis craig.lewis@owen.vanderbilt.edu

References

Lewis, C. and Grossetti, F. (2019 - forthcoming):
A Statistical Approach for Optimal Topic Model Identification.

See Also

LDA data.table

Examples

1
2
3
4
5
6
7
8
## Not run: 
# Compute word proportions from a corpus objects
test1 = optimal_topic( lda_models = lda_list,
                        weighted_dfm = weighted_dfm,
                        q = 0.80,
                        alpha = 0.05 )

## End(Not run)

contefranz/OpTop documentation built on Feb. 14, 2022, 7:04 p.m.