tokenize_word | R Documentation |
In BERT: tokenization.py, this code is inside the tokenize method for WordpieceTokenizer objects. I've moved it into its own function for clarity. Punctuation should already have been removed from the word.
tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)
word |
Word to tokenize. |
vocab |
Character vector containing vocabulary words |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Input word as a list of tokens.
tokenize_word("unknown", vocab = c("un" = 0, "##known" = 1)) tokenize_word("known", vocab = c("un" = 0, "##known" = 1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.