Description Usage Format Note Author(s) Source See Also
It's a subset of the 20Newsgroups dataset. This corpus consists of 10,764 news articles and 9,208 unique words.
1 | data("news")
|
vocab
a vector of unique words in the corpus vocabulary.
docs
a list of documents in the corpus. Each item (represents a
document) is a matrix (2 X U) of word frequencies, where U represents the
number of unique words in a document. Each column in the matrix represents
a unique word in a document and contains
vocabulary-id. the index of the word in the vocabulary (starts with 0)
frequency. the relative frequency of the word in the document
docs.metadata
a matrix of document (article) metadata, where each
row represents a document with
category. the category assigned to the article
name. the name of the news article from the 20Newsgroups dataset
doclength. the number of words in the article
collection. the collection name of each article
cids
a vector of document collection IDs
class.labels
a vector of categories (classes) in the corpus
collection.labels
a vector of collections in the corpus
ds.name
the corpus name (string)
num.docs
the number of documents in the corpus
V
the vocabulary size
Created on July 26, 2015
Clint P. George
Articles are downloaded via scikit-learn
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.