create_vocabulary: Creates a vocabulary of unique terms
In text2vec: Modern Text Mining Framework for R

create_vocabulary

R Documentation

Creates a vocabulary of unique terms

Description

This function collects unique terms and corresponding statistics. See the below for details.

Usage

create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)

vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)

## S3 method for class 'character'
create_vocabulary(it, ngram = c(ngram_min = 1L,
  ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
  window_size = 0L, ...)

## S3 method for class 'itoken'
create_vocabulary(it, ngram = c(ngram_min = 1L,
  ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
  window_size = 0L, ...)

## S3 method for class 'itoken_parallel'
create_vocabulary(it, ngram = c(ngram_min = 1L,
  ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
  window_size = 0L, ...)

Arguments

`it`	iterator over a `list` of `character` vectors, which are the documents from which the user wants to construct a vocabulary. See itoken. Alternatively, a `character` vector of user-defined vocabulary terms (which will be used "as is").
`ngram`	`integer` vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of `n` such that ngram_min <= n <= ngram_max will be used.
`stopwords`	`character` vector of stopwords to filter out. NOTE that stopwords will be used "as is". This means that if preprocessing function in itoken does some text modification (like stemming), then this preprocessing need to be applied to stopwords before passing them here. See https://github.com/dselivanov/text2vec/issues/228 for example.
`sep_ngram`	`character` a character string to concatenate words in ngrams
`window_size`	`integer` (0 by default). If `window_size > 0` than vocabulary will be created from pseudo-documents which are obtained by virtually splitting each documents into chunks of the length `window_size` by going with sliding window through them. This is useful for creating special statistics which are used for coherence estimation in topic models.
`...`	placeholder for additional arguments (not used at the moment).

Value

text2vec_vocabulary object, which is actually a data.frame with following columns:

`term`	`character` vector of unique terms
`term_count`	`integer` vector of term counts across all documents
`doc_count`	`integer` vector of document counts that contain corresponding term

Also it contains metainformation in attributes: ngram: integer vector, the lower and upper boundary of the range of n-gram-values. document_count: integer number of documents vocabulary was built. stopwords: character vector of stopwords sep_ngram: character separator for ngrams

Methods (by class)

character: creates text2vec_vocabulary from predefined character vector. Terms will be inserted as is, without any checks (ngrams number, ngram delimiters, etc.).
itoken: collects unique terms and corresponding statistics from object.
itoken_parallel: collects unique terms and corresponding statistics from iterator.

Examples

data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)

text2vec documentation built on Nov. 9, 2023, 9:07 a.m.