FormatRawLdaOutput: Format Raw Output from 'lda.collapsed.gibbs.sampler'

Description Usage Arguments Value Examples

Description

extracts outputs from LDA model estimated with lda package by Jonathan Chang

Usage

1
FormatRawLdaOutput(lda_result, docnames, smooth = TRUE, softmax = FALSE)

Arguments

lda_result

The list value returned by lda.collapsed.gibbs.sampler

docnames

A character vector giving the names of documents. This is generally rownames(dtm).

smooth

Logical. Do you want to smooth your topic proportions so that there is a positive value for each term in each topic? Defaults to TRUE

softmax

Logical. Do you want to use the softmax function to normalize raw output? If FALSE (the default) output is normalized using standard sum.

Value

Returns a list with two elements: phi whose rows represent the distribution of words across a topic and theta whose rows represent the distribution of topics across a document.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Load a pre-formatted dtm and topic model
data(nih_sample_dtm) 

# Get a sample of documents
dtm <- nih_sample_dtm[ sample(1:nrow(nih_sample_dtm), 20) , ]

# re-create a character vector of documents from the DTM
lex <- Dtm2Docs(dtm)

# Format for input to lda::lda.collapsed.gibbs.sampler
lex <- lda::lexicalize(lex, vocab=colnames(dtm))

# Fit the model from lda::lda.collapsed.gibbs.sampler
lda <- lda::lda.collapsed.gibbs.sampler(documents = lex, K = 5, 
                                         vocab = colnames(dtm), 
                                         num.iterations=200, 
                                         alpha=0.1, eta=0.05)
                                         
# Format the result to get phi and theta matrices                                        
lda <- FormatRawLdaOutput(lda_result=lda, docnames=rownames(dtm), smooth=TRUE)

ChengMengli/topic documentation built on May 31, 2019, 8:44 p.m.