dot-mp_tokenize_word: Tokenize a Word
In macmillancontentscience/morphemepiece: Morpheme Tokenization

.mp_tokenize_word

R Documentation

Tokenize a Word

Description

Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but typically punctuation has been split off by this point.

Usage

.mp_tokenize_word(
  word,
  vocab_split,
  dir = 1,
  allow_compounds = TRUE,
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

`word`	Word to tokenize.
`vocab_split`	List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes".
`dir`	Integer; if 1 (the default), look for tokens starting at the beginning of the word. Otherwise, start at the end.
`allow_compounds`	Logical; whether to allow multiple whole words in the breakdown.
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.

Details

This is an adaptation of wordpiece:::.tokenize_word. The main differences are that it was designed to work with a morphemepiece vocabulary, which can include prefixes (denoted like "pre##"). As in wordpiece, the algorithm uses a repeated greedy search for the largest piece from the vocabulary found within the word, but starting from either the beginning or the end of the word (controlled by the dir parameter). The input vocabulary must be split into prefixes, suffixes, and "words".