View source: R/sentencepiece.R
sentencepiece | R Documentation |
Construct a Sentencepiece model on text.
sentencepiece( x, type = c("bpe", "char", "unigram", "word"), vocab_size = 8000, coverage = 0.9999, model_prefix = "sentencepiece", model_dir = tempdir(), threads = 1L, args, verbose = FALSE )
x |
a character vector of path(s) to the text files containing training data |
type |
either one of 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding, Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). |
vocab_size |
integer indicating the number of tokens in the final vocabulary. Defaults to 8000. |
coverage |
fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. |
model_prefix |
character string with the name of the model. Defaults to 'sentencepiece'.
When executing the function 2 files will be created in the directory specified by |
model_dir |
directory where the model will be saved. Defaults to the temporary directory (tempdir()) |
threads |
integer indicating number of threads to use when building the model |
args |
character string with arguments passed on to sentencepiece::SentencePieceTrainer::Train (for expert use only) |
verbose |
logical indicating to show progress of sentencepiece training. Defaults to |
an object of class sentencepiece
which is defined at sentencepiece_load_model
sentencepiece_load_model
library(tokenizers.bpe) data(belgium_parliament, package = "tokenizers.bpe") path <- "traindata.txt" folder <- getwd() writeLines(belgium_parliament$text, con = path) model <- sentencepiece(path, type = "char", model_dir = folder, verbose = TRUE) model <- sentencepiece(path, type = "unigram", vocab_size = 20000, model_dir = folder, verbose = TRUE) model <- sentencepiece(path, type = "bpe", vocab_size = 4000, model_dir = folder, verbose = TRUE) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids") model <- sentencepiece_load_model(file.path(folder, "sentencepiece.model")) sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.