| trainer_unigram | R Documentation |
Unigram tokenizer trainer
Unigram tokenizer trainer
tok::tok_trainer -> tok_trainer_unigram
new()Constructor for the Unigram tokenizer
trainer_unigram$new( vocab_size = 8000, show_progress = TRUE, special_tokens = NULL, shrinking_factor = 0.75, unk_token = NULL, max_piece_length = 16, n_sub_iterations = 2 )
vocab_sizeThe size of the final vocabulary, including all tokens and alphabet.
show_progressWhether to show progress bars while training.
special_tokensA list of special tokens the model should be aware of.
shrinking_factorThe shrinking factor used at each step of training to prune the vocabulary.
unk_tokenThe token used for out-of-vocabulary tokens.
max_piece_lengthThe maximum length of a given token.
n_sub_iterationsThe number of iterations of the EM algorithm to perform before pruning the vocabulary.
initial_alphabetA list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
clone()The objects of this class are cloneable with this method.
trainer_unigram$clone(deep = FALSE)
deepWhether to make a deep clone.
Other trainer:
tok_trainer,
trainer_bpe,
trainer_wordpiece
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.