| .tokenize_bert_single | R Documentation |
Tokenize a single vector of text
.tokenize_bert_single(
text,
n_tokens = 64L,
increment_index = TRUE,
pad_token = "[PAD]",
cls_token = "[CLS]",
sep_token = "[SEP]",
tokenizer = wordpiece::wordpiece_tokenize,
vocab = wordpiece.data::wordpiece_vocab(),
tokenizer_options = NULL
)
text |
A character vector, or a list of length-1 character vectors. |
n_tokens |
Integer scalar; the number of tokens expected for each example. |
increment_index |
Logical; if TRUE, add 1L to all token ids to convert from the Python-inspired 0-indexed standard to the torch 1-indexed standard. |
pad_token |
Character scalar; the token to use for padding. Must be present in the supplied vocabulary. |
cls_token |
Character scalar; the token to use at the start of each
example. Must be present in the supplied vocabulary, or |
sep_token |
Character scalar; the token to use at the end of each
segment within each example. Must be present in the supplied vocabulary, or
|
tokenizer |
The tokenizer function to use to break up the text. It must
have a |
vocab |
The vocabulary to use to tokenize the text. This vocabulary must
include the |
tokenizer_options |
A named list of additional arguments to pass on to the tokenizer. |
An object of class "bert_tokens", which is a list containing a matrix of token ids, a matrix of token type ids, and a matrix of token names.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.