imdb_dataset: IMDB movie review sentiment classification dataset
In torchdatasets: Ready to Use Extra Datasets for Torch

imdb_dataset

R Documentation

IMDB movie review sentiment classification dataset

Description

The format of this dataset is meant to replicate that provided by Keras.

Usage

imdb_dataset(
  root,
  download = FALSE,
  split = "train",
  shuffle = (split == "train"),
  num_words = Inf,
  skip_top = 0,
  maxlen = Inf,
  start_char = 2,
  oov_char = 3,
  index_from = 4
)

Arguments

`root`	path to the data location
`download`	wether to download or not
`split`	train, test or valid
`shuffle`	whether to shuffle or not the dataset. `TRUE` if `split=="train"`
`num_words`	Words are ranked by how often they occur (in the training set), and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If `Inf`, all words are kept. Defaults to None, so all words are kept.
`skip_top`	skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
`maxlen`	int or `Inf`. Maximum sequence length. Any longer sequence will be truncated. Defaults to Inf, which means no truncation.
`start_char`	The start of a sequence will be marked with this character. Defaults to 2, because 1 is usually the padding character.
`oov_char`	int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
`index_from`	int. Index actual words with this index and higher.