| tokenizer | R Documentation | 
A Tokenizer works as a pipeline. It processes some raw text as input and outputs an encoding.
A tokenizer that can be used for encoding character strings or decoding integers.
.tokenizer(unsafe usage) Lower level pointer to tokenizer
pre_tokenizerinstance of the pre-tokenizer
normalizerGets the normalizer instance
post_processorGets the post processor used by tokenizer
decoderGets and sets the decoder
paddingGets padding configuration
truncationGets truncation configuration
new()Initializes a tokenizer
tokenizer$new(tokenizer)
tokenizerWill be cloned to initialize a new tokenizer
encode()Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
tokenizer$encode( sequence, pair = NULL, is_pretokenized = FALSE, add_special_tokens = TRUE )
sequenceThe main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument
pairAn optional input sequence. The expected format is the same that for sequence.
is_pretokenizedWhether the input is already pre-tokenized
add_special_tokensWhether to add the special tokens
decode()Decode the given list of ids back to a string
tokenizer$decode(ids, skip_special_tokens = TRUE)
idsThe list of ids that we want to decode
skip_special_tokensWhether the special tokens should be removed from the decoded string
encode_batch()Encodes a batch of sequences. Returns a list of encodings.
tokenizer$encode_batch( input, is_pretokenized = FALSE, add_special_tokens = TRUE )
inputA list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument.
is_pretokenizedWhether the input is already pre-tokenized
add_special_tokensWhether to add the special tokens
decode_batch()Decode a batch of ids back to their corresponding string
tokenizer$decode_batch(sequences, skip_special_tokens = TRUE)
sequencesThe batch of sequences we want to decode
skip_special_tokensWhether the special tokens should be removed from the decoded strings
from_file()Creates a tokenizer from the path of a serialized tokenizer.
This is a static method and should be called instead of $new when initializing
the tokenizer.
tokenizer$from_file(path)
pathPath to tokenizer.json file
from_pretrained()Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.
tokenizer$from_pretrained(identifier, revision = "main", auth_token = NULL)
identifierThe identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file
revisionA branch or commit id
auth_tokenAn optional auth token used to access private repositories on the Hugging Face Hub
train()Train the Tokenizer using the given files. Reads the files line by line, while keeping all the whitespace, even new lines.
tokenizer$train(files, trainer)
filescharacter vector of file paths.
traineran instance of a trainer object, specific to that tokenizer type.
train_from_memory()Train the tokenizer on a chracter vector of texts
tokenizer$train_from_memory(texts, trainer)
textsa character vector of texts.
traineran instance of a trainer object, specific to that tokenizer type.
save()Saves the tokenizer to a json file
tokenizer$save(path, pretty = TRUE)
pathA path to a file in which to save the serialized tokenizer.
prettyWhether the JSON file should be pretty formatted.
enable_padding()Enables padding for the tokenizer
tokenizer$enable_padding( direction = "right", pad_id = 0L, pad_type_id = 0L, pad_token = "[PAD]", length = NULL, pad_to_multiple_of = NULL )
direction(str, optional, defaults to right) — The direction in which
to pad. Can be either 'right' or 'left'
pad_id(int, defaults to 0) — The id to be used when padding
pad_type_id(int, defaults to 0) — The type id to be used when padding
pad_token(str, defaults to '[PAD]') — The pad token to be used when padding
length(int, optional) — If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
pad_to_multiple_of(int, optional) — If specified, the padding length should
always snap to the next multiple of the given value. For example if we were
going to pad with a length of 250 but pad_to_multiple_of=8 then we will
pad to 256.
no_padding()Disables padding
tokenizer$no_padding()
enable_truncation()Enables truncation on the tokenizer
tokenizer$enable_truncation( max_length, stride = 0, strategy = "longest_first", direction = "right" )
max_lengthThe maximum length at which to truncate.
strideThe length of the previous first sequence to be included
in the overflowing sequence. Default: 0.
strategyThe strategy used for truncation. Can be one of: "longest_first", "only_first", or "only_second". Default: "longest_first".
directionThe truncation direction. Default: "right".
no_truncation()Disables truncation
tokenizer$no_truncation()
get_vocab_size()Gets the vocabulary size
tokenizer$get_vocab_size(with_added_tokens = TRUE)
with_added_tokensWether to count added tokens
clone()The objects of this class are cloneable with this method.
tokenizer$clone(deep = FALSE)
deepWhether to make a deep clone.
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
try({
tok <- tokenizer$from_pretrained("gpt2")
tok$encode("Hello world")$ids
})
})
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.