dot-wp_tokenize_single_string: Tokenize an Input Word-by-word
In wordpiece: R Implementation of Wordpiece Tokenization

.wp_tokenize_single_string

R Documentation

Tokenize an Input Word-by-word

Tokenize an Input Word-by-word

.wp_tokenize_single_string(words, vocab, unk_token, max_chars)

`words`	Character; a vector of words (generated by space-tokenizing a single input).
`vocab`	Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.