data_gen: data_gen

Description Arguments Value

View source: R/data_gen.R

Description

Function to generate a simulated dataset following the LDA model.

Arguments

n_doc

The number of documents to generate

n_vocab

The number of words in the corpus

n_top

The number of topics/clusters (K)

doc_length_scale

A number proportional to the average number of words in a document (default = 8)

doc_length_scale_var

A number proportional to the variance of the average number of words in a document (default = 2)

voc_p_scale

A number proportional to the initial probability of each word in a cluster. (default = 4) The higher, the less uniform weight gets applied across all topics.

spike_overlap

A number proportional to the amount of vocabulary shared across documents from different clusters. (default = 0.05) The default value of 0.05 means that documents from different clusters will share ~5% of their word distributions with each other.

alphaWords

Hyperparameter for document-cluster distribution (default = 0.2)

alphaTopics

Hyperparameter for topic-cluster distribution (default = 0.2)

seed

The random seed for the data generation (ran once at beginning of function, default = 19890418)

topic_mix

Boolean flag, if TRUE then each document can be generated from different topic clusters (default = FALSE)

DEBUG

Boolean flag, if TRUE then debug print statements are shown to the user (default = FALSE)

Value

list("dat","word_dist","gen_topics","doc_len")


cvraut/viLDA documentation built on Dec. 19, 2021, 7:05 p.m.