.wp_tokenize_word | R Documentation |
Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but in BERT's tokenization, punctuation has been split out by this point.
.wp_tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)
word |
Word to tokenize. |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Input word as a list of tokens.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.