Description Usage Format Note Author(s) Source See Also
A corpus created from the 20Newsgroups dataset. This corpus is created from a random subset of articles from the 20Newsgroups categories:
Automobiles (50 documents)
Motorcycles (50 documents)
Baseball (50 documents)
Hockey (50 documents).
All four of these categories are classified under the super-category Recreation in the 20Newsgroups dataset.
1 | data("rec")
|
vocab
a vector of unique words in the corpus vocabulary.
docs
a list of documents in the corpus. Each item (represents a
document) is a matrix (2 X U) of word frequencies, where U represents the
number of unique words in a document. Each column in the matrix represents
a unique word in a document and contains
vocabulary-id. the index of the word in the vocabulary (starts with 0)
frequency. the relative frequency of the word in the document
docs.metadata
a matrix of document (article) metadata, where each
row represents a document with
category. the Wikipedia category assigned to the article
title. the title of the Wikipedia web article
doc.N
a vector of word counts of documents in the corpus
num.docs
the number of documents in the corpus
class.labels
a vector of unique categories (classes) in the corpus
ds.name
the corpus name (string)
ds
a list of two equal-length vectors
wid. vocabulary ids of the instances of words in the corpus (a vector)
did. document indices of the instances of words in the corpus (a vector)
Created on November 21, 2015
Clint P. George
Articles and categories are adapted from the 20Newsgroups dataset.
Other datasets: autos-motorcycles
,
bop
, canis
,
cats
, felines
,
ibm-mac
,
med-christian-baseball
, sci
,
whales
, wt16m
,
wt16
, wt
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.