Description Usage Format Details Note Author(s) Source See Also
A subset of Wikipedia artricles under the Wikipedia categories Whales and Tires, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 16 Wikipedia articles.
1 |
vocab a vector of unique words in the corpus vocabulary.
docs a list of documents in the corpus. Each item (represents a
document) is a matrix (2 X U) of word frequencies, where U represents the
number of unique words in a document. Each column in the matrix represents
a unique word in a document and contains
vocabulary-id. the index of the word in the vocabulary (starts with 0)
frequency. the relative frequency of the word in the document
docs.metadata a matrix of document (article) metadata, where each
row represents a document with
category. the Wikipedia category assigned to the article
title. the title of the Wikipedia web article
doc.N a vector of word counts of documents in the corpus
num.docs the number of documents in the corpus
class.labels a vector of unique categories (classes) in the corpus
ds.name the corpus name (string)
ds a list of two equal-length vectors
wid. vocabulary ids of the instances of words in the corpus (a vector)
did. document indices of the instances of words in the corpus (a vector)
############################################################################# Documents all datasets available in the R package: ldamcmc #############################################################################
Created on November 21, 2015
Clint P. George
Articles are downloaded from the English Wikipedia with the help of Media Wiki API.
Other datasets: autos-motorcycles,
bop, canis,
cats, felines,
ibm-mac,
med-christian-baseball, rec,
sci, whales,
wt16m, wt
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.