news: 16 Newsgroups Dataset

Description Usage Format Note Author(s) Source See Also

Description

It's a subset of the 20Newsgroups dataset. This corpus consists of 10,764 news articles and 9,208 unique words.

Usage

1
data("news")

Format

vocab a vector of unique words in the corpus vocabulary.

docs a list of documents in the corpus. Each item (represents a document) is a matrix (2 X U) of word frequencies, where U represents the number of unique words in a document. Each column in the matrix represents a unique word in a document and contains

docs.metadata a matrix of document (article) metadata, where each row represents a document with

cids a vector of document collection IDs

class.labels a vector of categories (classes) in the corpus

collection.labels a vector of collections in the corpus

ds.name the corpus name (string)

num.docs the number of documents in the corpus

V the vocabulary size

Note

Created on July 26, 2015

Author(s)

Clint P. George

Source

Articles are downloaded via scikit-learn

See Also

Other datasets: nips, yelp


clintpgeorge/clda documentation built on May 13, 2019, 8 p.m.