tokenize_bert: Prepare Text for a BERT Model

View source: R/tokenize.R

tokenize_bertR Documentation

Prepare Text for a BERT Model

Description

To be used in a BERT-style model, text must be tokenized. In addition, text is optionally preceded by a cls_token, and segments are ended with a sep_token. Finally each example must be padded with a pad_token, or truncated if necessary (preserving the wrapper tokens). Many use cases use a matrix of tokens x examples, which can be extracted directly with the simplify argument.

Usage

tokenize_bert(
  ...,
  n_tokens = 64L,
  increment_index = TRUE,
  pad_token = "[PAD]",
  cls_token = "[CLS]",
  sep_token = "[SEP]",
  tokenizer = wordpiece::wordpiece_tokenize,
  vocab = wordpiece.data::wordpiece_vocab(),
  tokenizer_options = NULL
)

Arguments

...

One or more character vectors or lists of character vectors. Currently we support a single character vector, two parallel character vectors, or a list of length-1 character vectors. If two vectors are supplied, they are combined pairwise and separated with sep_token.

n_tokens

Integer scalar; the number of tokens expected for each example.

increment_index

Logical; if TRUE, add 1L to all token ids to convert from the Python-inspired 0-indexed standard to the torch 1-indexed standard.

pad_token

Character scalar; the token to use for padding. Must be present in the supplied vocabulary.

cls_token

Character scalar; the token to use at the start of each example. Must be present in the supplied vocabulary, or NULL.

sep_token

Character scalar; the token to use at the end of each segment within each example. Must be present in the supplied vocabulary, or NULL.

tokenizer

The tokenizer function to use to break up the text. It must have a vocab argument.

vocab

The vocabulary to use to tokenize the text. This vocabulary must include the pad_token, cls_token, and sep_token.

tokenizer_options

A named list of additional arguments to pass on to the tokenizer.

Value

An object of class "bert_tokens", which is a list containing a matrix of token ids, a matrix of token type ids, and a matrix of token names.

Examples

tokenize_bert(
  c("The first premise.", "The second premise."),
  c("The first hypothesis.", "The second hypothesis.")
)

macmillancontentscience/torchtransformers documentation built on Aug. 6, 2023, 5:35 a.m.