Description Usage Format Details Note Author(s) Source See Also
A subset of Wikipedia artricles under the Wikipedia categories Whales and Tires, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 16 Wikipedia articles.
1 |
vocab
a vector of unique words in the corpus vocabulary.
docs
a list of documents in the corpus. Each item (represents a
document) is a matrix (2 X U) of word frequencies, where U represents the
number of unique words in a document. Each column in the matrix represents
a unique word in a document and contains
vocabulary-id. the index of the word in the vocabulary (starts with 0)
frequency. the relative frequency of the word in the document
docs.metadata
a matrix of document (article) metadata, where each
row represents a document with
category. the Wikipedia category assigned to the article
title. the title of the Wikipedia web article
doc.N
a vector of word counts of documents in the corpus
num.docs
the number of documents in the corpus
class.labels
a vector of unique categories (classes) in the corpus
ds.name
the corpus name (string)
ds
a list of two equal-length vectors
wid. vocabulary ids of the instances of words in the corpus (a vector)
did. document indices of the instances of words in the corpus (a vector)
############################################################################# Documents all datasets available in the R package: ldamcmc #############################################################################
Created on November 21, 2015
Clint P. George
Articles are downloaded from the English Wikipedia with the help of Media Wiki API.
Other datasets: autos-motorcycles
,
bop
, canis
,
cats
, felines
,
ibm-mac
,
med-christian-baseball
, rec
,
sci
, whales
,
wt16m
, wt
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.