download_twenty_newsgroups: Download 20 Newsgroups

View source: R/ng20.R

download_twenty_newsgroupsR Documentation

Download 20 Newsgroups

Description

Downloads the 20 Newsgroups dataset, which contains approximately 20,000 newsgroup documents from 20 different newsgroups. The distribution is approximately balanced.

Usage

download_twenty_newsgroups(
  subset = "all",
  verbose = FALSE,
  tmpdir = NULL,
  cleanup = TRUE
)

Arguments

subset

A string specifying which subset of the dataset to download and process. Acceptable values are "train" for the training set, "test" for the test set, and "all" for both sets combined. Default is "all".

verbose

If TRUE, log progress of download, extraction and processing.

tmpdir

A string specifying the directory where the dataset will be downloaded and extracted. If NULL (default), a temporary directory is used. If a path is provided and does not exist, it will be created.

cleanup

A logical flag indicating whether to delete the downloaded and extracted files after processing. If TRUE and tmpdir was created by this function, tmpdir will be deleted after processing. Default is FALSE.

Format

A data frame with 5 variables:

Id

A unique identifier for the document, consisting of the subset concatenated with the position in the subset, e.g. train_1.

FileId

The integer identifier of the document, from the filename of the downloaded data. Be aware that these are not unique.

Text

The full text of the message including any header, footer, and quotes. Newlines are preserved.

Subset

A factor with two levels: train and test, indicating whether the document is from the training or test subset.

Label

The newsgroup represented by an integer id, in the range 0-19.

Newsgroup

A factor with 20 levels, indicating the newsgroup that the document belongs to.

The labels correspond to:

0

alt.atheism

1

comp.graphics

2

comp.os.ms-windows.misc

3

comp.sys.ibm.pc.hardware

4

comp.sys.mac.hardware

5

comp.windows.x

6

misc.forsale

7

rec.autos

8

rec.motorcycles

9

rec.sport.baseball

10

rec.sport.hockey

11

sci.crypt

12

sci.electronics

13

sci.med

14

sci.space

15

soc.religion.christian

16

talk.politics.guns

17

talk.politics.mideast

18

talk.politics.misc

19

talk.religion.misc

and are also present as the Newsgroup factor.

There are 11,314 items in the train dataset and 7,532 items in the test for a total of 18,846 items if you choose subset = "all".

Details

To do any analysis on this text, you will want to use tools from packages such as tm and tidytext. The files are read as latin1 encoding, but there can still be some odd control codes in some of the messages.

Value

Data frame containing 20 Newsgroups Data.

References

Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning 1995 (pp. 331-339). Morgan Kaufmann.

See Also

http://qwone.com/~jason/20Newsgroups/

Chapter 9 of Tidy Text Mining with R for a case study using the same dataset.

Examples

## Not run: 

# Download and process the training set
ng_train <- download_twenty_newsgroups(subset = "train")

# Download and process both training and test sets, with verbose output
ng_all <- download_twenty_newsgroups(subset = "all", verbose = TRUE)

# Download and process the test set, using a specific directory and enabling
# cleanup
ng_test <- download_twenty_newsgroups(
  subset = "test",
  tmpdir = "path/to/dir", cleanup = TRUE
)

## End(Not run)


jlmelville/snedata documentation built on Jan. 13, 2024, 2:06 a.m.