getProbs: Compute topic-word and document-topic probability...

Description Usage Arguments Value Examples

Description

This function assumes the ordering of word.id, doc.id, topic.id matters! That is, the first element of word.id corresponds to the first element of doc.id which corresponds to the first element of topic.id. Similarly, the second element of tokens corresponds to the second element of doc.id which corresponds to the second element of topic.id (and so on). Also, the ordering of the elements of vocab are assumed to correspond to the elements of word.id, so that the first element of vocab is the token with word.id equal to 1, the second element of vocab is the token with word.id equal to 2, etc.

Usage

1
2
3
getProbs(word.id = numeric(), doc.id = numeric(), topic.id = numeric(),
  vocab = character(), alpha = 0.01, beta = 0.01,
  sort.topics = c("None", "byDocs", "byTerms"), K = integer())

Arguments

word.id

a numeric vector with the token id of each token occurrence in the data.

doc.id

a numeric vector containing the document id number of each token occurrence in the data.

topic.id

a numeric vector with a unique value for each topic.

vocab

a character vector of the unique words included in the corpus. The length of this vector should match the max value of word.id.

alpha

Dirichlet hyperparameter. See fitLDA.

beta

Dirichlet hyperparameter. See fitLDA.

sort.topics

Sorting criterion for topics. Supported methods include: "byDocs" to sort topics by the number of documents for which they are the most probable or "byTerms" to sort topics by the number of terms within topic.

Value

A list of two matrices and one vector. The first matrix is, phi.hat, contains the distribution over tokens for each topic, where the rows correspond to topics. The second matrix, theta.hat, contains the distribution over topics for each document, where the rows correspond to documents. The vector returned by the function, topic.id, is the vector of sampled topics from the LDA fit, with topic indices re-labeled in decreasing order of frequency by the sort.topics argument.

Examples

1
2
3
4
5
6
7
8
data(APinput)
#takes a while
## Not run: o <- fitLDA(APinput$word.id, APinput$doc.id)
data(APtopics) #load output instead for demonstration purposes
probs <- getProbs(word.id=APinput$word.id, doc.id=APinput$doc.id, topic.id=APtopics$topics,
                   vocab=APinput$vocab)
head(probs$phi.hat[,1:5])
head(probs$theta.hat)

kshirley/LDAtools documentation built on May 20, 2019, 7:03 p.m.