A collection of newsgroup messages with classes.

Description

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Usage

1
2
3
4
5
6

Format

newsgroup.train.documents and newsgroup.test.documents comprise a corpus of 20,000 newsgroup documents conforming to the LDA format, partitioned into 11269 training and 7505 training and test cases evenly distributed across 20 classes.

newsgroup.train.labels is a numeric vector of length 11269 which gives a class label from 1 to 20 for each training document in the corpus.

newsgroup.test.labels is a numeric vector of length 7505 which gives a class label from 1 to 20 for each training document in the corpus.

newsgroup.vocab is the vocabulary of the corpus.

newsgroup.label.map maps the numeric class labels to actual class names.

Source

http://qwone.com/~jason/20Newsgroups/

See Also

lda.collapsed.gibbs.sampler for the format of the corpus.

Examples

1
2
3
4
5
6