top_docs: Top-ranked documents in topics
In agoldst/dfrtopics: Tools for exploring topic models of text

top_docs

R Documentation

Top-ranked documents in topics

Description

Extracts a data frame of documents scoring high in each topic. Documents are represented as numeric indices. The scoring is done on the basis of the document-topic matrix, but here some care is needed in deciding about cases in which a document has more of its words assigned to a given topic but a smaller proportion of that topic than some other, shorter document. By default all documents are normalized to length 1 before ranking here.

Usage

top_docs(m, n, ...)

Arguments

`m`	`mallet_model` object
`n`	number of top documents to extract
`weighting`	a function to transform the document-topic matrix. By default `dt_smooth_normalize(m)`, a normalized weighting function

Details

Note also that a topic may reach its maximum proportion in a document even if that document has a yet larger proportion of another topic. To adjust the scoring, pass a function to transform the document-topic matrix in the weighting parameter. If you wish to use raw weights rather than proportions to rank documents, set weighting=identity. Raw weights give longer documents an unfair advantage, whereas proportions often give shorter documents an advantage (because short documents tend to be dominated by single topics in LDA).

TODO: alternative scoring methods.

Value

a data frame with three columns, topic, doc, the numerical index of the document in doc_ids(m), and weight, the weight used in ranking (topic proportion, raw score, ...)

Examples

## Not run: 
# obtain citations for 3 documents with highest proportions of topic 4
top_docs(m, 3) %>%
    filter(topic == 4) %>%
    select(-topic) %>%
    mutate(citation=cite_articles(metadata(m)[doc, ]))

## End(Not run)

agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.