download_twenty_newsgroups | R Documentation |
Downloads the 20 Newsgroups dataset, which contains approximately 20,000 newsgroup documents from 20 different newsgroups. The distribution is approximately balanced.
download_twenty_newsgroups(
subset = "all",
verbose = FALSE,
tmpdir = NULL,
cleanup = TRUE
)
subset |
A string specifying which subset of the dataset to download and
process. Acceptable values are |
verbose |
If |
tmpdir |
A string specifying the directory where the dataset will be
downloaded and extracted. If |
cleanup |
A logical flag indicating whether to delete the downloaded and
extracted files after processing. If |
A data frame with 5 variables:
Id
A unique identifier for the document, consisting of the
subset concatenated with the position in the subset, e.g. train_1
.
FileId
The integer identifier of the document, from the filename of the downloaded data. Be aware that these are not unique.
Text
The full text of the message including any header, footer, and quotes. Newlines are preserved.
Subset
A factor with two levels: train
and test
,
indicating whether the document is from the training or test subset.
Label
The newsgroup represented by an integer id, in the range 0-19.
Newsgroup
A factor with 20 levels, indicating the newsgroup that the document belongs to.
The labels correspond to:
0
alt.atheism
1
comp.graphics
2
comp.os.ms-windows.misc
3
comp.sys.ibm.pc.hardware
4
comp.sys.mac.hardware
5
comp.windows.x
6
misc.forsale
7
rec.autos
8
rec.motorcycles
9
rec.sport.baseball
10
rec.sport.hockey
11
sci.crypt
12
sci.electronics
13
sci.med
14
sci.space
15
soc.religion.christian
16
talk.politics.guns
17
talk.politics.mideast
18
talk.politics.misc
19
talk.religion.misc
and are also present as the Newsgroup
factor.
There are 11,314 items in the train
dataset and 7,532 items in the
test
for a total of 18,846 items if you choose subset = "all"
.
To do any analysis on this text, you will want to use tools from packages
such as tm and
tidytext. The files
are read as latin1
encoding, but there can still be some odd control
codes in some of the messages.
Data frame containing 20 Newsgroups Data.
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning 1995 (pp. 331-339). Morgan Kaufmann.
http://qwone.com/~jason/20Newsgroups/
Chapter 9 of Tidy Text Mining with R for a case study using the same dataset.
## Not run:
# Download and process the training set
ng_train <- download_twenty_newsgroups(subset = "train")
# Download and process both training and test sets, with verbose output
ng_all <- download_twenty_newsgroups(subset = "all", verbose = TRUE)
# Download and process the test set, using a specific directory and enabling
# cleanup
ng_test <- download_twenty_newsgroups(
subset = "test",
tmpdir = "path/to/dir", cleanup = TRUE
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.