dot-wp_tokenize_word: Tokenize a Word
In wordpiece: R Implementation of Wordpiece Tokenization

.wp_tokenize_word

R Documentation

Tokenize a Word

Description

Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but in BERT's tokenization, punctuation has been split out by this point.

Usage

.wp_tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)

Arguments

`word`	Word to tokenize.
`vocab`	Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.