The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. We use in this package only its first 3 classes for demonstration purposes.
1 2 3 4 5
comprise a corpus of 2731 newsgroup documents partitioned into 1633 training
and 1098 test cases evenly distributed across 3 classes.
newsgroup.train.labels is a numeric vector of length 1633 which gives
a class label from 1 to 3 for each training document in the corpus.
newsgroup.test.labels is a numeric vector of length 1098 which gives
a class label from 1 to 3 for each test document in the corpus.
newsgroup.vocab is the vocabulary of the corpus.
stopwords English stopwords extracted from the tm package.
1 2 3 4 5 6