wordpiece_tokenize: Tokenize Sequence with Word Pieces
In wordpiece: R Implementation of Wordpiece Tokenization

wordpiece_tokenize

R Documentation

Tokenize Sequence with Word Pieces

Given a sequence of text and a wordpiece vocabulary, tokenizes the text.

wordpiece_tokenize(
  text,
  vocab = wordpiece_vocab(),
  unk_token = "[UNK]",
  max_chars = 100
)

`text`	Character; text to tokenize.
`vocab`	Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.

A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.

tokens <- wordpiece_tokenize(
  text = c(
    "I love tacos!",
    "I also kinda like apples."
  )
)

wordpiece documentation built on March 18, 2022, 5:55 p.m.

wordpiece index

README.md Using wordpiece

Note that we can't provide technical support on individual packages. You should contact the package authors for that.