read_diagnostics: Read MALLET model-diagnostic results.

read_diagnosticsR Documentation

Read MALLET model-diagnostic results.

Description

Uses the XML package and libxml to parse the MALLET diagnostic output.

Usage

read_diagnostics(xml_file)

Arguments

xml_file

file holding XML to be parsed.

Value

a list of two dataframes of diagnostic information, topics and words. The diagnostics are sparsely documented by the MALLET source code (http://hg-iesl.cs.umass.edu/hg/mallet: see src/cc/mallet/topics/TopicModelDiagnostics.java).

In topics, columns include:

topic

The 1-indexed topic number.

corpus_dist

The KL-divergence from the corpus. A useful diagnostic of a topic's distinctiveness.

coherence

The topic coherence measure defined by Mimno et al., eq. (1): the sum of log-co-document-document frequency ratios for the top words in the topic. The number of top words is set in the n_top_words parameter to write_diagnostics.

The function attempts to coerce numeric values, which XML extracts as strings, into numbers.

References

David Mimno et al. Optimizing Semantic Coherence in Topic Models. EMNLP 2011. http://www.cs.princeton.edu/~mimno/papers/mimno-semantic-emnlp.pdf.

See Also

write_diagnostics


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.