dot-tokenize_bert_single: Tokenize a single vector of text

.tokenize_bert_singleR Documentation

Tokenize a single vector of text

Description

Tokenize a single vector of text

Usage

.tokenize_bert_single(
  text,
  n_tokens = 64L,
  increment_index = TRUE,
  pad_token = "[PAD]",
  cls_token = "[CLS]",
  sep_token = "[SEP]",
  tokenizer = wordpiece::wordpiece_tokenize,
  vocab = wordpiece.data::wordpiece_vocab(),
  tokenizer_options = NULL
)

Arguments

text

A character vector, or a list of length-1 character vectors.

n_tokens

Integer scalar; the number of tokens expected for each example.

increment_index

Logical; if TRUE, add 1L to all token ids to convert from the Python-inspired 0-indexed standard to the torch 1-indexed standard.

pad_token

Character scalar; the token to use for padding. Must be present in the supplied vocabulary.

cls_token

Character scalar; the token to use at the start of each example. Must be present in the supplied vocabulary, or NULL.

sep_token

Character scalar; the token to use at the end of each segment within each example. Must be present in the supplied vocabulary, or NULL.

tokenizer

The tokenizer function to use to break up the text. It must have a vocab argument.

vocab

The vocabulary to use to tokenize the text. This vocabulary must include the pad_token, cls_token, and sep_token.

tokenizer_options

A named list of additional arguments to pass on to the tokenizer.

Value

An object of class "bert_tokens", which is a list containing a matrix of token ids, a matrix of token type ids, and a matrix of token names.


macmillancontentscience/torchtransformers documentation built on Aug. 6, 2023, 5:35 a.m.