tokenize_word: Tokenize a single "word" (no whitespace).

View source: R/tokenization.R

tokenize_wordR Documentation

Tokenize a single "word" (no whitespace).

Description

In BERT: tokenization.py, this code is inside the tokenize method for WordpieceTokenizer objects. I've moved it into its own function for clarity. Punctuation should already have been removed from the word.

Usage

tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)

Arguments

word

Word to tokenize.

vocab

Character vector containing vocabulary words

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

Input word as a list of tokens.

Examples

tokenize_word("unknown", vocab = c("un" = 0, "##known" = 1))
tokenize_word("known", vocab = c("un" = 0, "##known" = 1))

jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.