tokenize_text: Tokenize Text with Word Pieces
In jonathanbratt/RBERT: R Implementation of BERT

tokenize_text

R Documentation

Tokenize Text with Word Pieces

Description

Given some text and a word piece vocabulary, tokenizes the text. This is primarily a tool for quickly checking the tokenization of a piece of text.

Usage

tokenize_text(
  text,
  ckpt_dir = NULL,
  vocab_file = find_vocab(ckpt_dir),
  include_special = TRUE
)

Arguments

`text`	Character vector; text to tokenize.
`ckpt_dir`	Character; path to checkpoint directory. If specified, any other checkpoint files required by this function (`vocab_file`, `bert_config_file`, or `init_checkpoint`) will default to standard filenames within `ckpt_dir`.
`vocab_file`	path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary.
`include_special`	Logical; whether to add the special tokens "[CLS]" (at the beginning) and "[SEP]" (at the end) of the token list.

Value

A list of character vectors, giving the tokenization of the input text.

Examples

## Not run: 
BERT_PRETRAINED_DIR <- download_BERT_checkpoint("bert_base_uncased")
tokens <- tokenize_text(
  text = c("Who doesn't like tacos?", "Not me!"),
  ckpt_dir = BERT_PRETRAINED_DIR
)

## End(Not run)

jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.