optimal_k: Find Optimal Number of Topics

Description Usage Arguments Value Author(s) References Examples

View source: R/optimal_k.R

Description

Iteratively produces models and then compares the harmonic mean of the log likelihoods in a graphical output.

Usage

1
2
3
4
optimal_k(x, max.k = 30, harmonic.mean = TRUE, control = if
  (harmonic.mean) list(burnin = 500, iter = 1000, keep = 100) else NULL,
  method = if (harmonic.mean) "Gibbs" else "VEM", verbose = TRUE,
  drop.seed = TRUE, ...)

Arguments

x

A DocumentTermMatrix.

max.k

Maximum number of topics to fit (start small [i.e., default of 30] and add as necessary).

harmonic.mean

Logical. If TRUE the harmonic means of the log likelihoods are used to determine k (see http://stackoverflow.com/a/21394092/1000343). Otherwise just the log likelihoods are graphed against k (see http://stats.stackexchange.com/a/25128/7482).

method

The method to be used for fitting; currently method = "VEM" or method= "Gibbs" are supported.

drop.seed

Logical. If TRUE seed argument is dropped from control.

burnin

Object of class "integer"; number of omitted Gibbs iterations at beginning, by default equals 0.

iter

Object of class "integer"; number of Gibbs iterations, by default equals 2000.

keep

Object of class "integer"; if a positive integer, the log likelihood is saved every keep iterations.

...

Other arguments passed to ??LDAcontrol.

Value

Returns the data.frame of k (nuber of topics) and the associated log likelihood.

Author(s)

Ben Marwick and Tyler Rinker <tyler.rinker@gmail.com>.

References

http://stackoverflow.com/a/21394092/1000343
http://stats.stackexchange.com/a/25128/7482
Ponweiser, M. (2012). Latent Dirichlet Allocation in R (Diploma Thesis). Vienna University of Economics and Business, Vienna. http://epub.wu.ac.at/3558/1/main.pdf

Griffiths, T.L., and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(Suppl 1), 5228 - 5235. http://www.pnas.org/content/101/suppl_1/5228.full.pdf

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## Install/Load Tools & Data
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/gofastr")
pacman::p_load(tm, topicmodels, dplyr, tidyr,  devtools, LDAvis, ggplot2)


## Source topicmodels2LDAvis function
devtools::source_url("https://gist.githubusercontent.com/trinker/477d7ae65ff6ca73cace/raw/79dbc9d64b17c3c8befde2436fdeb8ec2124b07b/topicmodels2LDAvis")

data(presidential_debates_2012)


## Generate Stopwords
stops <- c(
        tm::stopwords("english"),
        "governor", "president", "mister", "obama","romney"
    ) %>%
    gofastr::prep_stopwords()


## Create the DocumentTermMatrix
doc_term_mat <- presidential_debates_2012 %>%
    with(gofastr::q_dtm_stem(dialogue, paste(person, time, sep = "_"))) %>%
    gofastr::remove_stopwords(stops) %>%
    gofastr::filter_tf_idf() %>%
    gofastr::filter_documents()


opti_k1 <- optimal_k(doc_term_mat)
opti_k1

opti_k2 <- optimal_k(doc_term_mat, harmonic.mean = FALSE)
opti_k2

mcallaghan/scimetrix documentation built on May 22, 2019, 12:58 p.m.