wordpiece_tokenize: Tokenize Sequence with Word Pieces

View source: R/tokenization.R

wordpiece_tokenizeR Documentation

Tokenize Sequence with Word Pieces

Description

Given a sequence of text and a wordpiece vocabulary, tokenizes the text.

Usage

wordpiece_tokenize(
  text,
  vocab = wordpiece_vocab(),
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

text

Character; text to tokenize.

vocab

Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.

Examples

tokens <- wordpiece_tokenize(
  text = c(
    "I love tacos!",
    "I also kinda like apples."
  )
)

wordpiece documentation built on March 18, 2022, 5:55 p.m.