Function to generate a simulated dataset following the LDA model.
n_doc |
The number of documents to generate |
n_vocab |
The number of words in the corpus |
n_top |
The number of topics/clusters (K) |
doc_length_scale |
A number proportional to the average number of words in a document (default = 8) |
doc_length_scale_var |
A number proportional to the variance of the average number of words in a document (default = 2) |
voc_p_scale |
A number proportional to the initial probability of each word in a cluster. (default = 4) The higher, the less uniform weight gets applied across all topics. |
spike_overlap |
A number proportional to the amount of vocabulary shared across documents from different clusters. (default = 0.05) The default value of 0.05 means that documents from different clusters will share ~5% of their word distributions with each other. |
alphaWords |
Hyperparameter for document-cluster distribution (default = 0.2) |
alphaTopics |
Hyperparameter for topic-cluster distribution (default = 0.2) |
seed |
The random seed for the data generation (ran once at beginning of function, default = 19890418) |
topic_mix |
Boolean flag, if TRUE then each document can be generated from different topic clusters (default = FALSE) |
DEBUG |
Boolean flag, if TRUE then debug print statements are shown to the user (default = FALSE) |
list("dat","word_dist","gen_topics","doc_len")
$dat: dataframe of the document_id-word_id-count data
$word_dist: matrix of the word-topic distributions
$gen_topics: the selected topic for each document
$doc_len: a n_doc length vector of the number of words in each document
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.