dot-mp_tokenize_word: Tokenize a Word

.mp_tokenize_wordR Documentation

Tokenize a Word

Description

Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but typically punctuation has been split off by this point.

Usage

.mp_tokenize_word(
  word,
  vocab_split,
  dir = 1,
  allow_compounds = TRUE,
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

word

Word to tokenize.

vocab_split

List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes".

dir

Integer; if 1 (the default), look for tokens starting at the beginning of the word. Otherwise, start at the end.

allow_compounds

Logical; whether to allow multiple whole words in the breakdown.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Details

This is an adaptation of wordpiece:::.tokenize_word. The main differences are that it was designed to work with a morphemepiece vocabulary, which can include prefixes (denoted like "pre##"). As in wordpiece, the algorithm uses a repeated greedy search for the largest piece from the vocabulary found within the word, but starting from either the beginning or the end of the word (controlled by the dir parameter). The input vocabulary must be split into prefixes, suffixes, and "words".

Value

Input word as a list of tokens.


macmillancontentscience/morphemepiece documentation built on April 19, 2022, 2:20 p.m.