dot-wp_tokenize_word: Tokenize a Word

.wp_tokenize_wordR Documentation

Tokenize a Word

Description

Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but in BERT's tokenization, punctuation has been split out by this point.

Usage

.wp_tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)

Arguments

word

Word to tokenize.

vocab

Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

Input word as a list of tokens.


wordpiece documentation built on March 18, 2022, 5:55 p.m.