wt16m: Whales and Tires (8 documents each)

Description Usage Format Note Author(s) Source See Also

Description

A subset of Wikipedia artricles under the Wikipedia categories Whales and Tires, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 16 Wikipedia articles. This is a variation of the data set wt16: we appended to each document in the Whales category a set of manually identified topical words from the Tires category, and vice-versa. The set size is about 10 size for wt16. The purpose of these mixing is to add noise to the Whales and Tires documents, which are relatively easy to distinguish, and determine the relative performance of the various LDA models on corpora in which the documents have similar topic features.

Usage

1

Format

vocab a vector of unique words in the corpus vocabulary.

docs a list of documents in the corpus. Each item (represents a document) is a matrix (2 X U) of word frequencies, where U represents the number of unique words in a document. Each column in the matrix represents a unique word in a document and contains

docs.metadata a matrix of document (article) metadata, where each row represents a document with

doc.N a vector of word counts of documents in the corpus

num.docs the number of documents in the corpus

class.labels a vector of unique categories (classes) in the corpus

ds.name the corpus name (string)

ds a list of two equal-length vectors

Note

Created on November 21, 2015

Author(s)

Clint P. George

Source

Articles are downloaded from the English Wikipedia with the help of Media Wiki API.

See Also

Other datasets: autos-motorcycles, bop, canis, cats, felines, ibm-mac, med-christian-baseball, rec, sci, whales, wt16, wt


clintpgeorge/ldamcmc documentation built on Feb. 22, 2020, 12:39 p.m.