Get the Top Words and Documents in Each Topic

Description

This function takes a model fitted using lda.collapsed.gibbs.sampler and returns a matrix of the top words in each topic.

Usage

1
2
top.topic.words(topics, num.words = 20, by.score = FALSE)
top.topic.documents(document_sums, num.documents = 20, alpha = 0.1)

Arguments

topics

For top.topic.words, a K \times V matrix where each entry is a numeric proportional to the probability of seeing the word (column) conditioned on topic (row) (this entry is sometimes denoted β_{w,k} in the literature, see details). The column names should correspond to the words in the vocabulary. The topics field from the output of lda.collapsed.gibbs.sampler can be used.

num.words

For top.topic.words, the number of top words to return for each topic.

document_sums

For top.topic.documents, a K \times D matrix where each entry is a numeric proportional to the probability of seeing a topic (row) conditioned on the document (column) (this entry is sometimes denoted θ_{d,k} in the literature, see details). The document_sums field from the output of lda.collapsed.gibbs.sampler can be used.

num.documents

For top.topic.documents, the number of top documents to return for each topic.

by.score

If by.score is set to FALSE (default), then words in each topic will be ranked according to probability mass for each word β_{w, k}. If by.score is TRUE, then words will be ranked according to a score defined by β_{w, k} (\log β_{w,k} - 1 / K ∑_{k'} \log β_{w,k'}).

alpha

Value

For top.topic.words, a num.words \times K character matrix where each column contains the top words for that topic.

For top.topic.documents, a num.documents \times K integer matrix where each column contains the top documents for that topic. The entries in the matrix are column-indexed references into document_sums.

Author(s)

Jonathan Chang (slycoder@gmail.com)

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

See Also

lda.collapsed.gibbs.sampler for the format of topics.

predictive.distribution demonstrates another use for a fitted topic matrix.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## From demo(lda).

data(cora.documents)
data(cora.vocab)

K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  ## Num clusters
                                      cora.vocab,
                                      25,  ## Num iterations
                                      0.1,
                                      0.1) 

## Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5, by.score=TRUE)

## top.words:
##      [,1]             [,2]        [,3]       [,4]            [,5]      
## [1,] "decision"       "network"   "planning" "learning"      "design"  
## [2,] "learning"       "time"      "visual"   "networks"      "logic"   
## [3,] "tree"           "networks"  "model"    "neural"        "search"  
## [4,] "trees"          "algorithm" "memory"   "system"        "learning"
## [5,] "classification" "data"      "system"   "reinforcement" "systems" 
##      [,6]         [,7]       [,8]           [,9]           [,10]      
## [1,] "learning"   "models"   "belief"       "genetic"      "research" 
## [2,] "search"     "networks" "model"        "search"       "reasoning"
## [3,] "crossover"  "bayesian" "theory"       "optimization" "grant"    
## [4,] "algorithm"  "data"     "distribution" "evolutionary" "science"  
## [5,] "complexity" "hidden"   "markov"       "function"     "supported"