trainer_unigram | R Documentation |
Unigram tokenizer trainer
Unigram tokenizer trainer
tok::tok_trainer
-> tok_trainer_unigram
new()
Constructor for the Unigram tokenizer
trainer_unigram$new( vocab_size = 8000, show_progress = TRUE, special_tokens = NULL, shrinking_factor = 0.75, unk_token = NULL, max_piece_length = 16, n_sub_iterations = 2 )
vocab_size
The size of the final vocabulary, including all tokens and alphabet.
show_progress
Whether to show progress bars while training.
special_tokens
A list of special tokens the model should be aware of.
shrinking_factor
The shrinking factor used at each step of training to prune the vocabulary.
unk_token
The token used for out-of-vocabulary tokens.
max_piece_length
The maximum length of a given token.
n_sub_iterations
The number of iterations of the EM algorithm to perform before pruning the vocabulary.
initial_alphabet
A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
clone()
The objects of this class are cloneable with this method.
trainer_unigram$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other trainer:
tok_trainer
,
trainer_bpe
,
trainer_wordpiece
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.