Description Usage Format Note Author(s) Source See Also
A subset of Wikipedia artricles under the Wikipedia category Whales, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 153 Wikipedia articles from the following Wikipedia subcategories:
Baleen whales
Dolphins
Killer whales
Oceanic dolphins
Whale products
Whaling
1 |
vocab
a vector of unique words in the corpus vocabulary.
docs
a list of documents in the corpus. Each item (represents a
document) is a matrix (2 X U) of word frequencies, where U represents the
number of unique words in a document. Each column in the matrix represents
a unique word in a document and contains
vocabulary-id. the index of the word in the vocabulary (starts with 0)
frequency. the relative frequency of the word in the document
docs.metadata
a matrix of document (article) metadata, where each
row represents a document with
category. the Wikipedia category assigned to the article
title. the title of the Wikipedia web article
doc.N
a vector of word counts of documents in the corpus
num.docs
the number of documents in the corpus
class.labels
a vector of unique categories (classes) in the corpus
ds.name
the corpus name (string)
ds
a list of two equal-length vectors
wid. vocabulary ids of the instances of words in the corpus (a vector)
did. document indices of the instances of words in the corpus (a vector)
Created on November 21, 2015
Clint P. George
Articles are downloaded from the English Wikipedia with the help of Media Wiki API.
Other datasets: autos-motorcycles
,
bop
, canis
,
cats
, felines
,
ibm-mac
,
med-christian-baseball
, rec
,
sci
, wt16m
,
wt16
, wt
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.