Description Usage Format Note Author(s) Source See Also
A subset of Wikipedia artricles under the Wikipedia categories Whales and Tires, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 16 Wikipedia articles. This is a variation of the data set wt16: we appended to each document in the Whales category a set of manually identified topical words from the Tires category, and vice-versa. The set size is about 10 size for wt16. The purpose of these mixing is to add noise to the Whales and Tires documents, which are relatively easy to distinguish, and determine the relative performance of the various LDA models on corpora in which the documents have similar topic features.
1 |
vocab
a vector of unique words in the corpus vocabulary.
docs
a list of documents in the corpus. Each item (represents a
document) is a matrix (2 X U) of word frequencies, where U represents the
number of unique words in a document. Each column in the matrix represents
a unique word in a document and contains
vocabulary-id. the index of the word in the vocabulary (starts with 0)
frequency. the relative frequency of the word in the document
docs.metadata
a matrix of document (article) metadata, where each
row represents a document with
category. the Wikipedia category assigned to the article
title. the title of the Wikipedia web article
doc.N
a vector of word counts of documents in the corpus
num.docs
the number of documents in the corpus
class.labels
a vector of unique categories (classes) in the corpus
ds.name
the corpus name (string)
ds
a list of two equal-length vectors
wid. vocabulary ids of the instances of words in the corpus (a vector)
did. document indices of the instances of words in the corpus (a vector)
Created on November 21, 2015
Clint P. George
Articles are downloaded from the English Wikipedia with the help of Media Wiki API.
Other datasets: autos-motorcycles
,
bop
, canis
,
cats
, felines
,
ibm-mac
,
med-christian-baseball
, rec
,
sci
, whales
,
wt16
, wt
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.