mallet.subset.topic.words: Estimate topic-word distributions from a sub-corpus

View source: R/mallet.R

mallet.subset.topic.wordsR Documentation

Estimate topic-word distributions from a sub-corpus

Description

This function returns a matrix of word probabilities for each topic similar to mallet.topic.words, but estimated from a subset of the documents in the corpus. The model assumes that topics are the same no matter where they are used, but we know this is often not the case. This function lets us test whether some words are used more or less than we expect in a particular set of documents.

Usage

mallet.subset.topic.words(
  topic.model,
  subset.docs,
  normalized = FALSE,
  smoothed = FALSE
)

Arguments

topic.model

A cc.mallet.topics.RTopicModel object created by MalletLDA.

subset.docs

A logical vector of TRUE/FALSE values specifying which documents should be used/included and which should be ignored.

normalized

If TRUE, normalize the rows so that each topic sums to one. If FALSE, values will be integers (possibly plus the smoothing constant) representing the actual number of words of each type in the topics.

smoothed

If TRUE, add the smoothing parameter for the model (initial value specified as beta in MalletLDA). If FALSE, many values will be zero.

Value

a number of topics by vocabulary size matrix for the the included documents.

See Also

mallet.topic.words

Examples

## Not run: 
# Read in sotu example data
data(sotu)
sotu.instances <-
   mallet.import(id.array = row.names(sotu),
                 text.array = sotu[["text"]],
                 stoplist = mallet_stoplist_file_path("en"),
                 token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

# Create topic model
topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1)
topic.model$loadDocuments(sotu.instances)

# Train topic model
topic.model$train(200)

# Extract subcorpus topic word matrix
post1975_topic_words <- mallet.subset.topic.words(topic.model, sotu[["year"]] > 1975)
mallet.top.words(topic.model, word.weights = post1975_topic_words[2,], num.top.words = 5)

## End(Not run)


mallet documentation built on July 20, 2022, 5:08 p.m.