sentencepiece: Construct a Sentencepiece model
In sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling

sentencepiece

R Documentation

Construct a Sentencepiece model

Description

Construct a Sentencepiece model on text.

Usage

sentencepiece(
  x,
  type = c("bpe", "char", "unigram", "word"),
  vocab_size = 8000,
  coverage = 0.9999,
  model_prefix = "sentencepiece",
  model_dir = tempdir(),
  threads = 1L,
  args,
  verbose = FALSE
)

Arguments

`x`	a character vector of path(s) to the text files containing training data
`type`	either one of 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding, Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding).
`vocab_size`	integer indicating the number of tokens in the final vocabulary. Defaults to 8000.
`coverage`	fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
`model_prefix`	character string with the name of the model. Defaults to 'sentencepiece'. When executing the function 2 files will be created in the directory specified by `model_dir`, namely sentencepiece.model with the model and sentencepiece.vocab containing the vocabulary of the model. You can change the name of the model by providing the `model_prefix` argument.
`model_dir`	directory where the model will be saved. Defaults to the temporary directory (tempdir())
`threads`	integer indicating number of threads to use when building the model
`args`	character string with arguments passed on to sentencepiece::SentencePieceTrainer::Train (for expert use only)
`verbose`	logical indicating to show progress of sentencepiece training. Defaults to `FALSE`.

Value

an object of class sentencepiece which is defined at sentencepiece_load_model

Examples

library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
path   <- "traindata.txt" 
folder <- getwd() 

writeLines(belgium_parliament$text, con = path)


model <- sentencepiece(path, type = "char", 
                       model_dir = folder, verbose = TRUE)
model <- sentencepiece(path, type = "unigram", vocab_size = 20000, 
                       model_dir = folder, verbose = TRUE)
model <- sentencepiece(path, type = "bpe", vocab_size = 4000, 
                       model_dir = folder, verbose = TRUE)

txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
         "On est d'accord sur le prix de la biere?")
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")


model <- sentencepiece_load_model(file.path(folder, "sentencepiece.model"))
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")

sentencepiece documentation built on Nov. 13, 2022, 5:05 p.m.