.mp_tokenize_word | R Documentation |
Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but typically punctuation has been split off by this point.
.mp_tokenize_word( word, vocab_split, dir = 1, allow_compounds = TRUE, unk_token = "[UNK]", max_chars = 100 )
word |
Word to tokenize. |
vocab_split |
List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes". |
dir |
Integer; if 1 (the default), look for tokens starting at the beginning of the word. Otherwise, start at the end. |
allow_compounds |
Logical; whether to allow multiple whole words in the breakdown. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
This is an adaptation of wordpiece:::.tokenize_word. The main differences are
that it was designed to work with a morphemepiece vocabulary, which can
include prefixes (denoted like "pre##"). As in wordpiece, the algorithm uses
a repeated greedy search for the largest piece from the vocabulary found
within the word, but starting from either the beginning or the end of the
word (controlled by the dir
parameter). The input vocabulary must be split
into prefixes, suffixes, and "words".
Input word as a list of tokens.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.