| tokenize_bert | R Documentation |
To be used in a BERT-style model, text must be tokenized. In addition, text
is optionally preceded by a cls_token, and segments are ended with a
sep_token. Finally each example must be padded with a
pad_token, or truncated if necessary (preserving the wrapper tokens).
Many use cases use a matrix of tokens x examples, which can be extracted
directly with the simplify argument.
tokenize_bert(
...,
n_tokens = 64L,
increment_index = TRUE,
pad_token = "[PAD]",
cls_token = "[CLS]",
sep_token = "[SEP]",
tokenizer = wordpiece::wordpiece_tokenize,
vocab = wordpiece.data::wordpiece_vocab(),
tokenizer_options = NULL
)
... |
One or more character vectors or lists of character vectors.
Currently we support a single character vector, two parallel character
vectors, or a list of length-1 character vectors. If two vectors are
supplied, they are combined pairwise and separated with |
n_tokens |
Integer scalar; the number of tokens expected for each example. |
increment_index |
Logical; if TRUE, add 1L to all token ids to convert from the Python-inspired 0-indexed standard to the torch 1-indexed standard. |
pad_token |
Character scalar; the token to use for padding. Must be present in the supplied vocabulary. |
cls_token |
Character scalar; the token to use at the start of each
example. Must be present in the supplied vocabulary, or |
sep_token |
Character scalar; the token to use at the end of each
segment within each example. Must be present in the supplied vocabulary, or
|
tokenizer |
The tokenizer function to use to break up the text. It must
have a |
vocab |
The vocabulary to use to tokenize the text. This vocabulary must
include the |
tokenizer_options |
A named list of additional arguments to pass on to the tokenizer. |
An object of class "bert_tokens", which is a list containing a matrix of token ids, a matrix of token type ids, and a matrix of token names.
tokenize_bert(
c("The first premise.", "The second premise."),
c("The first hypothesis.", "The second hypothesis.")
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.