dot-wp_tokenize_single_string: Tokenize an Input Word-by-word

.wp_tokenize_single_stringR Documentation

Tokenize an Input Word-by-word

Description

Tokenize an Input Word-by-word

Usage

.wp_tokenize_single_string(words, vocab, unk_token, max_chars)

Arguments

words

Character; a vector of words (generated by space-tokenizing a single input).

vocab

Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A named integer vector of tokenized words.


wordpiece documentation built on March 18, 2022, 5:55 p.m.