top_docs | R Documentation |
Extracts a data frame of documents scoring high in each topic. Documents are represented as numeric indices. The scoring is done on the basis of the document-topic matrix, but here some care is needed in deciding about cases in which a document has more of its words assigned to a given topic but a smaller proportion of that topic than some other, shorter document. By default all documents are normalized to length 1 before ranking here.
top_docs(m, n, ...)
m |
|
n |
number of top documents to extract |
weighting |
a function to transform the document-topic matrix. By
default |
Note also that a topic may reach its maximum proportion in a document even if
that document has a yet larger proportion of another topic. To adjust the
scoring, pass a function to transform the document-topic matrix in the
weighting
parameter. If you wish to use raw weights rather than
proportions to rank documents, set weighting=identity
. Raw weights
give longer documents an unfair advantage, whereas proportions often give
shorter documents an advantage (because short documents tend to be dominated
by single topics in LDA).
TODO: alternative scoring methods.
a data frame with three columns, topic
, doc
, the
numerical index of the document in doc_ids(m)
, and
weight
, the weight used in ranking (topic proportion, raw score,
...)
doc_topics
, dt_smooth_normalize
## Not run: # obtain citations for 3 documents with highest proportions of topic 4 top_docs(m, 3) %>% filter(topic == 4) %>% select(-topic) %>% mutate(citation=cite_articles(metadata(m)[doc, ])) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.