prepare_vocab: Format a Token List as a Vocabulary
In morphemepiece: Morpheme Tokenization

prepare_vocab

R Documentation

Format a Token List as a Vocabulary

Description

We use a character vector with class morphemepiece_vocabulary to provide information about tokens used in morphemepiece_tokenize. This function takes a character vector of tokens and puts it into that format.

Usage

prepare_vocab(token_list)

Arguments

token_list

A character vector of tokens.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.

Examples

my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")

morphemepiece documentation built on April 16, 2022, 5:05 p.m.

morphemepiece index

Package overview README.md Generating a Vocabulary and Lookup Testing the fall-through algorithm

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

morphemepiece
Morpheme Tokenization

prepare_vocab: Format a Token List as a Vocabulary
In morphemepiece: Morpheme Tokenization

Format a Token List as a Vocabulary

Description

Usage

Arguments

Value

Examples

Related to prepare_vocab in morphemepiece...

R Package Documentation

Browse R Packages

We want your feedback!

morphemepiece Morpheme Tokenization

prepare_vocab: Format a Token List as a Vocabulary In morphemepiece: Morpheme Tokenization

Format a Token List as a Vocabulary

Description

Usage

Arguments

Value

Examples

Related to prepare_vocab in morphemepiece...

R Package Documentation

Browse R Packages

We want your feedback!

morphemepiece
Morpheme Tokenization

prepare_vocab: Format a Token List as a Vocabulary
In morphemepiece: Morpheme Tokenization