tokenize_text: Tokenize Text with Word Pieces

View source: R/tokenization.R

tokenize_textR Documentation

Tokenize Text with Word Pieces

Description

Given some text and a word piece vocabulary, tokenizes the text. This is primarily a tool for quickly checking the tokenization of a piece of text.

Usage

tokenize_text(
  text,
  ckpt_dir = NULL,
  vocab_file = find_vocab(ckpt_dir),
  include_special = TRUE
)

Arguments

text

Character vector; text to tokenize.

ckpt_dir

Character; path to checkpoint directory. If specified, any other checkpoint files required by this function (vocab_file, bert_config_file, or init_checkpoint) will default to standard filenames within ckpt_dir.

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary.

include_special

Logical; whether to add the special tokens "[CLS]" (at the beginning) and "[SEP]" (at the end) of the token list.

Value

A list of character vectors, giving the tokenization of the input text.

Examples

## Not run: 
BERT_PRETRAINED_DIR <- download_BERT_checkpoint("bert_base_uncased")
tokens <- tokenize_text(
  text = c("Who doesn't like tacos?", "Not me!"),
  ckpt_dir = BERT_PRETRAINED_DIR
)

## End(Not run)

jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.