wordpiece_encode: Wordpiece encoding
In sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling

wordpiece_encode

R Documentation

Wordpiece encoding

Description

Wordpiece encoding, usefull for BERT-style tokenisation. Experimental version mimicing class WordpieceTokenizer from https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py

Usage

wordpiece_encode(
  x,
  vocabulary = character(),
  type = c("subwords", "ids"),
  unk_token = "[UNK]",
  max_input_chars_per_word = 100L
)

Arguments

`x`	a character vector with text which can be splitted based on white space to obtain words
`vocabulary`	a character vector of the vocabulary
`type`	a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'.
`unk_token`	character string with a value for a token which is not part of the vocabulary. Defaults to '[UNK]'
`max_input_chars_per_word`	integer. A word which is longer than this specified number of characters will be set to the unknown token.

Value

a list of subword tokens

Examples

wordpiece_encode("unaffable", vocabulary = c("un", "##aff", "##able")) 
wordpiece_encode(x = c("unaffable", "unaffableun"), 
                 vocabulary = c("un", "##aff", "##able"))
wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"), 
                 vocabulary = c("un", "##aff", "##able", "##un")) 
wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"), 
                 vocabulary = c("un", "##aff", "##able", "##un"),
                 type = "ids")

sentencepiece documentation built on Nov. 13, 2022, 5:05 p.m.