load_or_retrieve_vocab: Load a vocabulary file, or retrieve from cache

View source: R/vocab.R

load_or_retrieve_vocabR Documentation

Load a vocabulary file, or retrieve from cache

Description

Usually you will want to use the included vocabulary that can be accessed via morphemepiece_vocab(). This function can be used to load (and cache) a different vocabulary from a file.

Usage

load_or_retrieve_vocab(vocab_file)

Arguments

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.


morphemepiece documentation built on April 16, 2022, 5:05 p.m.