prepare_vocab: Format a Token List as a Vocabulary
In wordpiece: R Implementation of Wordpiece Tokenization

prepare_vocab

R Documentation

Format a Token List as a Vocabulary

Description

We use a special named integer vector with class wordpiece_vocabulary to provide information about tokens used in wordpiece_tokenize. This function takes a character vector of tokens and puts it into that format.

Usage

prepare_vocab(token_list)

Arguments

token_list

A character vector of tokens.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.

Examples

my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")

wordpiece documentation built on March 18, 2022, 5:55 p.m.

wordpiece index

README.md Using wordpiece

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

wordpiece
R Implementation of Wordpiece Tokenization

prepare_vocab: Format a Token List as a Vocabulary
In wordpiece: R Implementation of Wordpiece Tokenization

Format a Token List as a Vocabulary

Description

Usage

Arguments

Value

Examples

Related to prepare_vocab in wordpiece...

R Package Documentation

Browse R Packages

We want your feedback!

wordpiece R Implementation of Wordpiece Tokenization

prepare_vocab: Format a Token List as a Vocabulary In wordpiece: R Implementation of Wordpiece Tokenization

Format a Token List as a Vocabulary

Description

Usage

Arguments

Value

Examples

Related to prepare_vocab in wordpiece...

R Package Documentation

Browse R Packages

We want your feedback!

wordpiece
R Implementation of Wordpiece Tokenization

prepare_vocab: Format a Token List as a Vocabulary
In wordpiece: R Implementation of Wordpiece Tokenization