trainer_wordpiece | R Documentation |
WordPiece tokenizer trainer
WordPiece tokenizer trainer
tok::tok_trainer
-> tok_trainer_wordpiece
new()
Constructor for the WordPiece tokenizer trainer
trainer_wordpiece$new( vocab_size = 30000, min_frequency = 0, show_progress = FALSE, special_tokens = NULL, limit_alphabet = NULL, initial_alphabet = NULL, continuing_subword_prefix = "##", end_of_word_suffix = NULL )
vocab_size
The size of the final vocabulary, including all tokens and alphabet.
Default: NULL
.
min_frequency
The minimum frequency a pair should have in order to be merged.
Default: NULL
.
show_progress
Whether to show progress bars while training. Default: TRUE
.
special_tokens
A list of special tokens the model should be aware of.
Default: NULL
.
limit_alphabet
The maximum number of different characters to keep in the alphabet.
Default: NULL
.
initial_alphabet
A list of characters to include in the initial alphabet,
even if not seen in the training dataset. If the strings contain more than
one character, only the first one is kept. Default: NULL
.
continuing_subword_prefix
A prefix to be used for every subword that is not a beginning-of-word.
Default: NULL
.
end_of_word_suffix
A suffix to be used for every subword that is an end-of-word.
Default: NULL
.
clone()
The objects of this class are cloneable with this method.
trainer_wordpiece$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other trainer:
tok_trainer
,
trainer_bpe
,
trainer_unigram
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.