agg_document_stability: Compute aggregate document stability and F-test

Description Usage Arguments Value Author(s) References See Also Examples

View source: R/agg_document_stability.R

Description

Detects informative and uninformative components to compute aggregate document stability. Performs a chi-square test to evaluate document stability, Also, computes a F-test to further evaluate deviation from optimal model.

Usage

1
2
3
4
5
6
7
8
9
agg_document_stability(
  lda_models,
  weighted_dfm,
  optimal_model,
  q = 0.8,
  alpha = 0.05,
  smoothed = TRUE,
  do_plot = TRUE
)

Arguments

lda_models

A list of ordered LDA models as estimated by LDA. The LDA models must be in ascending order according to the number of topics.

weighted_dfm

A weighted dfm containing word proportions. It is recommended that weighted_dfm has the corresponding internal variable that can be accessed with docid. See ?optimal_topic for more details.

optimal_model

A number corresponding to the optimal topic model.

q

Set a cutoff for important words as the quantile of the expected cumulative probability of word weights. Default to 0.80, meaning that the function reaches 80% of the distribution mass and leaves out the remaining 20%.

alpha

Alpha level to identify informative words from the Cumulative Distribution Function over the cosine similarities in the Topic Word Weights matrix. Default to 0.05.

smoothed

A logical to control whether the test is performed on each document for each LDA model or on the smoothed chi-square statistic. This is the aggregated version which gives the overall behavior across all documents in the corpus. Default is TRUE.

do_plot

Plot the chi-square statistic and the F-statistic as functions of the number of topics. Default to TRUE.

Value

A data.table containing the following columns:

topic

An integer giving the number of topics.

id_doc

An integer document id as given in the original corpus.

chisq_inform_std

A numeric giving the standardized chi-square statistic for the informative component.

chisq_uninform_std

A numeric giving the standardized chi-square statistic for the uninformative component.

pval_inform

A numeric giving the p-value of the chi-square test over the informative component.

pval_uninform

A numeric giving the p-value of the chi-square test over the uninformative component.

Fstat

A numeric giving the standardized F statistic of the ratio chisq_inform_std/chisq_uninform_std.

pval_Fstat

A numeric giving the p-value of the F test.

Author(s)

Francesco Grossetti francesco.grossetti@unibocconi.it.

Craig M. Lewis craig.lewis@owen.vanderbilt.edu

References

Lewis, C. and Grossetti, F. (2019 - forthcoming):
A Statistical Approach for Optimal Topic Model Identification.

See Also

LDA data.table

Examples

1
2
3
4
5
6
## Not run: 
test4 <- agg_document_stability( lda_models = lda_list,
                                 weighted_dfm = weighted_dfm,
                                 smoothed = TRUE, do_plot = TRUE )

## End(Not run)

contefranz/OpTop documentation built on Feb. 14, 2022, 7:04 p.m.