download_twenty_newsgroups: Download 20 Newsgroups
In jlmelville/snedata: SNE Simulation Dataset Functions

download_twenty_newsgroups

R Documentation

Download 20 Newsgroups

Description

Downloads the 20 Newsgroups dataset, which contains approximately 20,000 newsgroup documents from 20 different newsgroups. The distribution is approximately balanced.

Usage

download_twenty_newsgroups(
  subset = "all",
  verbose = FALSE,
  tmpdir = NULL,
  cleanup = TRUE
)

Arguments

`subset`	A string specifying which subset of the dataset to download and process. Acceptable values are `"train"` for the training set, `"test"` for the test set, and `"all"` for both sets combined. Default is `"all"`.
`verbose`	If `TRUE`, log progress of download, extraction and processing.
`tmpdir`	A string specifying the directory where the dataset will be downloaded and extracted. If `NULL` (default), a temporary directory is used. If a path is provided and does not exist, it will be created.
`cleanup`	A logical flag indicating whether to delete the downloaded and extracted files after processing. If `TRUE` and `tmpdir` was created by this function, `tmpdir` will be deleted after processing. Default is `FALSE`.

Format

A data frame with 5 variables:

Id: A unique identifier for the document, consisting of the subset concatenated with the position in the subset, e.g. train_1.
FileId: The integer identifier of the document, from the filename of the downloaded data. Be aware that these are not unique.
Text: The full text of the message including any header, footer, and quotes. Newlines are preserved.
Subset: A factor with two levels: train and test, indicating whether the document is from the training or test subset.
Label: The newsgroup represented by an integer id, in the range 0-19.
Newsgroup: A factor with 20 levels, indicating the newsgroup that the document belongs to.

The labels correspond to:

0: alt.atheism
1: comp.graphics
2: comp.os.ms-windows.misc
3: comp.sys.ibm.pc.hardware
4: comp.sys.mac.hardware
5: comp.windows.x
6: misc.forsale
7: rec.autos
8: rec.motorcycles
9: rec.sport.baseball
10: rec.sport.hockey
11: sci.crypt
12: sci.electronics
13: sci.med
14: sci.space
15: soc.religion.christian
16: talk.politics.guns
17: talk.politics.mideast
18: talk.politics.misc
19: talk.religion.misc

and are also present as the Newsgroup factor.

There are 11,314 items in the train dataset and 7,532 items in the test for a total of 18,846 items if you choose subset = "all".

Details

To do any analysis on this text, you will want to use tools from packages such as tm and tidytext. The files are read as latin1 encoding, but there can still be some odd control codes in some of the messages.

Value

Data frame containing 20 Newsgroups Data.

References

Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning 1995 (pp. 331-339). Morgan Kaufmann.

Examples

## Not run: 

# Download and process the training set
ng_train <- download_twenty_newsgroups(subset = "train")

# Download and process both training and test sets, with verbose output
ng_all <- download_twenty_newsgroups(subset = "all", verbose = TRUE)

# Download and process the test set, using a specific directory and enabling
# cleanup
ng_test <- download_twenty_newsgroups(
  subset = "test",
  tmpdir = "path/to/dir", cleanup = TRUE
)

## End(Not run)

jlmelville/snedata documentation built on March 5, 2025, 12:22 p.m.