create_vocabulary: Create a pruned vocabulary from a token iterator

Description Usage Arguments Value

View source: R/create-vocabulary.R

Description

This function creates a vocabulary from a vector of documents. A vocabulary defines the domain of a natural language processing problem. Vocabularies are often used to create vectorisers, which allow novel pieces of text to be mapped to a vocabulary defined by a training set. To exclude frequently and infrequently occurring tokens, the vocabulary is often trimmed. This reduces the dimension of the problem to decrease training time and the potential for overfitting.

Usage

1
create_vocabulary(documents, doc_proportion_min = 0, doc_proportion_max = 1)

Arguments

documents

A vector of characters, often sentences or paragraphs.

doc_proportion_min

Optional. A number between 0 and 1 which specifies the minimum proportion of documents in which a token appears in order to be included in the vocabulary. Defaults to 0 (no effect).

doc_proportion_max

Optional. A number between 0 and 1 which specifies the maximum proportion of documents in which a token appears in order to be included in the vocabulary. Defaults to 1 (no effect).

Value

A vocabulary object used in the text2vec package


mdneuzerling/ModelAsAPackage documentation built on Feb. 1, 2020, 12:57 a.m.