bop: Birds of Prey (C-9)

Description Usage Format Note Author(s) Source See Also

Description

A subset of Wikipedia artricles under the Wikipedia category Birds of Prey, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 304 Wikipedia articles from the following Wikipedia categories:

Usage

1

Format

vocab a vector of unique words in the corpus vocabulary.

docs a list of documents in the corpus. Each item (represents a document) is a matrix (2 X U) of word frequencies, where U represents the number of unique words in a document. Each column in the matrix represents a unique word in a document and contains

docs.metadata a matrix of document (article) metadata, where each row represents a document with

doc.N a vector of word counts of documents in the corpus

num.docs the number of documents in the corpus

class.labels a vector of unique categories (classes) in the corpus

ds.name the corpus name (string)

ds a list of two equal-length vectors

Note

Created on November 21, 2015

Author(s)

Clint P. George

Source

Articles are downloaded from the English Wikipedia with the help of Media Wiki API.

See Also

Other datasets: autos-motorcycles, canis, cats, felines, ibm-mac, med-christian-baseball, rec, sci, whales, wt16m, wt16, wt


clintpgeorge/ldamcmc documentation built on Feb. 22, 2020, 12:39 p.m.