tokenize_bert: Prepare Text for a BERT Model
In macmillancontentscience/torchtransformers: Transformer Models in Torch

tokenize_bert

R Documentation

Prepare Text for a BERT Model

Description

To be used in a BERT-style model, text must be tokenized. In addition, text is optionally preceded by a cls_token, and segments are ended with a sep_token. Finally each example must be padded with a pad_token, or truncated if necessary (preserving the wrapper tokens). Many use cases use a matrix of tokens x examples, which can be extracted directly with the simplify argument.

Usage

tokenize_bert(
  ...,
  n_tokens = 64L,
  increment_index = TRUE,
  pad_token = "[PAD]",
  cls_token = "[CLS]",
  sep_token = "[SEP]",
  tokenizer = wordpiece::wordpiece_tokenize,
  vocab = wordpiece.data::wordpiece_vocab(),
  tokenizer_options = NULL
)

Arguments

`...`	One or more character vectors or lists of character vectors. Currently we support a single character vector, two parallel character vectors, or a list of length-1 character vectors. If two vectors are supplied, they are combined pairwise and separated with `sep_token`.
`n_tokens`	Integer scalar; the number of tokens expected for each example.
`increment_index`	Logical; if TRUE, add 1L to all token ids to convert from the Python-inspired 0-indexed standard to the torch 1-indexed standard.
`pad_token`	Character scalar; the token to use for padding. Must be present in the supplied vocabulary.
`cls_token`	Character scalar; the token to use at the start of each example. Must be present in the supplied vocabulary, or `NULL`.
`sep_token`	Character scalar; the token to use at the end of each segment within each example. Must be present in the supplied vocabulary, or `NULL`.
`tokenizer`	The tokenizer function to use to break up the text. It must have a `vocab` argument.
`vocab`	The vocabulary to use to tokenize the text. This vocabulary must include the `pad_token, cls_token, and sep_token`.
`tokenizer_options`	A named list of additional arguments to pass on to the tokenizer.

Value

An object of class "bert_tokens", which is a list containing a matrix of token ids, a matrix of token type ids, and a matrix of token names.

Examples

tokenize_bert(
  c("The first premise.", "The second premise."),
  c("The first hypothesis.", "The second hypothesis.")
)

macmillancontentscience/torchtransformers documentation built on Aug. 6, 2023, 5:35 a.m.

macmillancontentscience/torchtransformers index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

macmillancontentscience/torchtransformers
Transformer Models in Torch

tokenize_bert: Prepare Text for a BERT Model
In macmillancontentscience/torchtransformers: Transformer Models in Torch

Prepare Text for a BERT Model

Description

Usage

Arguments

Value

Examples

Related to tokenize_bert in macmillancontentscience/torchtransformers...

R Package Documentation

Browse R Packages

We want your feedback!

macmillancontentscience/torchtransformers Transformer Models in Torch

tokenize_bert: Prepare Text for a BERT Model In macmillancontentscience/torchtransformers: Transformer Models in Torch

Prepare Text for a BERT Model

Description

Usage

Arguments

Value

Examples

Related to tokenize_bert in macmillancontentscience/torchtransformers...

R Package Documentation

Browse R Packages

We want your feedback!

macmillancontentscience/torchtransformers
Transformer Models in Torch

tokenize_bert: Prepare Text for a BERT Model
In macmillancontentscience/torchtransformers: Transformer Models in Torch