wordpiece_tokenize | R Documentation |
Given a sequence of text and a wordpiece vocabulary, tokenizes the text.
wordpiece_tokenize( text, vocab = wordpiece_vocab(), unk_token = "[UNK]", max_chars = 100 )
text |
Character; text to tokenize. |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.
tokens <- wordpiece_tokenize( text = c( "I love tacos!", "I also kinda like apples." ) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.