A shortened collection of newsgroup messages with the first 3 classes.

Share:

Description

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. We use in this package only its first 3 classes for demonstration purposes.

Usage

1
2
3
4
5

Format

newsgroup.train.documents and newsgroup.test.documents comprise a corpus of 2731 newsgroup documents partitioned into 1633 training and 1098 test cases evenly distributed across 3 classes.

newsgroup.train.labels is a numeric vector of length 1633 which gives a class label from 1 to 3 for each training document in the corpus.

newsgroup.test.labels is a numeric vector of length 1098 which gives a class label from 1 to 3 for each test document in the corpus.

newsgroup.vocab is the vocabulary of the corpus.

stopwords English stopwords extracted from the tm package.

Source

http://qwone.com/~jason/20Newsgroups/

Examples

1
2
3
4
5
6