wt: Whales and Tires

Description Usage Format Note Author(s) Source See Also

Description

A subset of Wikipedia artricles under the Wikipedia categories Whales and Tires, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 84 Wikipedia articles.

Usage

1

Format

vocab a vector of unique words in the corpus vocabulary.

docs a list of documents in the corpus. Each item (represents a document) is a matrix (2 X U) of word frequencies, where U represents the number of unique words in a document. Each column in the matrix represents a unique word in a document and contains

docs.metadata a matrix of document (article) metadata, where each row represents a document with

doc.N a vector of word counts of documents in the corpus

num.docs the number of documents in the corpus

class.labels a vector of unique categories (classes) in the corpus

ds.name the corpus name (string)

ds a list of two equal-length vectors

Note

Created on November 21, 2015

Author(s)

Clint P. George

Source

Articles are downloaded from the English Wikipedia with the help of Media Wiki API.

See Also

Other datasets: autos-motorcycles, bop, canis, cats, felines, ibm-mac, med-christian-baseball, rec, sci, whales, wt16m, wt16


clintpgeorge/ldamcmc documentation built on Feb. 22, 2020, 12:39 p.m.