wt16m: Whales and Tires (8 documents each)
In clintpgeorge/ldamcmc: Markov chain Monte Carlo Algorithms for the Latent Dirichlet Allocation Model

Description Usage Format Note Author(s) Source See Also

A subset of Wikipedia artricles under the Wikipedia categories Whales and Tires, formatted for running various Gibbs sampling algorithms of the latent Dirichlet allocation model. This dataset contains 16 Wikipedia articles. This is a variation of the data set wt16: we appended to each document in the Whales category a set of manually identified topical words from the Tires category, and vice-versa. The set size is about 10 size for wt16. The purpose of these mixing is to add noise to the Whales and Tires documents, which are relatively easy to distinguish, and determine the relative performance of the various LDA models on corpora in which the documents have similar topic features.

1	data(wt16m)

vocab a vector of unique words in the corpus vocabulary.

docs a list of documents in the corpus. Each item (represents a document) is a matrix (2 X U) of word frequencies, where U represents the number of unique words in a document. Each column in the matrix represents a unique word in a document and contains

vocabulary-id. the index of the word in the vocabulary (starts with 0)
frequency. the relative frequency of the word in the document

docs.metadata a matrix of document (article) metadata, where each row represents a document with

category. the Wikipedia category assigned to the article
title. the title of the Wikipedia web article

doc.N a vector of word counts of documents in the corpus

num.docs the number of documents in the corpus

class.labels a vector of unique categories (classes) in the corpus

ds.name the corpus name (string)

ds a list of two equal-length vectors

wid. vocabulary ids of the instances of words in the corpus (a vector)
did. document indices of the instances of words in the corpus (a vector)

Created on November 21, 2015

Clint P. George

Articles are downloaded from the English Wikipedia with the help of Media Wiki API.

Other datasets: autos-motorcycles, bop, canis, cats, felines, ibm-mac, med-christian-baseball, rec, sci, whales, wt16, wt

clintpgeorge/ldamcmc documentation built on Feb. 22, 2020, 12:39 p.m.

clintpgeorge/ldamcmc index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

clintpgeorge/ldamcmc
Markov chain Monte Carlo Algorithms for the Latent Dirichlet Allocation Model

wt16m: Whales and Tires (8 documents each)
In clintpgeorge/ldamcmc: Markov chain Monte Carlo Algorithms for the Latent Dirichlet Allocation Model

Description

Usage

Format

Note

Author(s)

Source

See Also

Related to wt16m in clintpgeorge/ldamcmc...

R Package Documentation

Browse R Packages

We want your feedback!

clintpgeorge/ldamcmc Markov chain Monte Carlo Algorithms for the Latent Dirichlet Allocation Model

wt16m: Whales and Tires (8 documents each) In clintpgeorge/ldamcmc: Markov chain Monte Carlo Algorithms for the Latent Dirichlet Allocation Model

Description

Usage

Format

Note

Author(s)

Source

See Also

Related to wt16m in clintpgeorge/ldamcmc...

R Package Documentation

Browse R Packages

We want your feedback!

clintpgeorge/ldamcmc
Markov chain Monte Carlo Algorithms for the Latent Dirichlet Allocation Model

wt16m: Whales and Tires (8 documents each)
In clintpgeorge/ldamcmc: Markov chain Monte Carlo Algorithms for the Latent Dirichlet Allocation Model