create_vocabulary: Creates a vocabulary of unique terms

Description Usage Arguments Value Methods (by class) Examples

View source: R/vocabulary.R

Description

This function collects unique terms and corresponding statistics. See the below for details.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_")

vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_")

## S3 method for class 'character'
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max
  = 1L), stopwords = character(0), sep_ngram = "_")

## S3 method for class 'itoken'
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
  1L), stopwords = character(0), sep_ngram = "_")

## S3 method for class 'list'
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
  1L), stopwords = character(0), sep_ngram = "_", ...)

## S3 method for class 'itoken_parallel'
create_vocabulary(it, ngram = c(ngram_min = 1L,
  ngram_max = 1L), stopwords = character(0), sep_ngram = "_", ...)

Arguments

it

iterator over a list of character vectors, which are the documents from which the user wants to construct a vocabulary. See itoken. Alternatively, a character vector of user-defined vocabulary terms (which will be used "as is").

ngram

integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min <= n <= ngram_max will be used.

stopwords

character vector of stopwords to filter out. NOTE that stopwords will be used "as is". This means that if preprocessing function in itoken does some text modification (like stemming), then this preprocessing need to be applied to stopwrods before passing them here. See https://github.com/dselivanov/text2vec/issues/228 for example.

sep_ngram

character a character string to concatenate words in ngrams

...

additional arguments to foreach function.

Value

text2vec_vocabulary object, which is actually a data.frame with following columns:

term

character vector of unique terms

term_count

integer vector of term counts across all documents

doc_count

integer vector of document counts that contain corresponding term

Also it contains metainformation in attributes: ngram: integer vector, the lower and upper boundary of the range of n-gram-values. document_count: integer number of documents vocabulary was built. stopwords: character vector of stopwords sep_ngram: character separator for ngrams

Methods (by class)

Examples

1
2
3
4
5
6
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)

Example output



text2vec documentation built on Jan. 12, 2018, 1:04 a.m.