prepare_vocab: Format a Token List as a Vocabulary

View source: R/vocab.R

prepare_vocabR Documentation

Format a Token List as a Vocabulary

Description

We use a character vector with class morphemepiece_vocabulary to provide information about tokens used in morphemepiece_tokenize. This function takes a character vector of tokens and puts it into that format.

Usage

prepare_vocab(token_list)

Arguments

token_list

A character vector of tokens.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.

Examples

my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")

morphemepiece documentation built on April 16, 2022, 5:05 p.m.