sentencepiece_download_model: Download a Sentencepiece model

View source: R/bpemb.R

sentencepiece_download_modelR Documentation

Download a Sentencepiece model

Description

Download pretrained models built on Wikipedia made available at https://bpemb.h-its.org through https://github.com/bheinzerling/bpemb. These models contain Byte Pair Encoded models trained with sentencepiece as well as Glove embeddings of these Byte Pair subwords. Models for 275 languages are available.

Usage

sentencepiece_download_model(
  language,
  vocab_size,
  dim,
  model_dir = system.file(package = "sentencepiece", "models")
)

Arguments

language

a character string with the language name. This can be either a plain language or a wikipedia shorthand.
Possible values can be found by looking at the examples or typing sentencepiece:::.bpemb$languages
If you provide multi it downloads the multilingual model available at https://bpemb.h-its.org/multi/

vocab_size

integer indicating the number of tokens in the final vocabulary. Defaults to 5000. Possible values depend on the language. To inspect possible values, type sentencepiece:::.bpemb$vocab_sizes and look to your language of your choice.

dim

dimension of the embedding. Either 25, 50, 100, 200 or 300.

model_dir

path to the location where the model will be downloaded to. Defaults to system.file(package = "sentencepiece", "models").

Value

a list with elements

  • language: the provided language

  • wikicode: the wikipedia code of the provided language

  • file_model: the path to the downloaded Sentencepiece model

  • url: the url where the Sentencepiece model was fetched from

  • download_failed: logical, indicating if the download failed

  • download_message: a character string with possible download failure information

  • glove: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in txt format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded

  • glove.bin: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in bin format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded

See Also

sentencepiece_load_model

Examples

path <- getwd()



##
## Download only the tokeniser model
##
dl <- sentencepiece_download_model("Russian", vocab_size = 50000, model_dir = path)
dl <- sentencepiece_download_model("English", vocab_size = 100000, model_dir = path)
dl <- sentencepiece_download_model("French", vocab_size = 25000, model_dir = path)
dl <- sentencepiece_download_model("multi", vocab_size = 320000, model_dir = path)
dl <- sentencepiece_download_model("Vlaams", vocab_size = 1000, model_dir = path)
dl <- sentencepiece_download_model("Dutch", vocab_size = 25000, model_dir = path)
dl <- sentencepiece_download_model("nl", vocab_size = 25000, model_dir = path)
str(dl)
model     <- sentencepiece_load_model(dl$file_model)

##
## Download the tokeniser model + Glove embeddings of Byte Pairs
##
dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 50, model_dir = path)
str(dl)
model     <- sentencepiece_load_model(dl$file_model)
embedding <- read_word2vec(dl$glove$file_model)



dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 25,
                                   model_dir = tempdir())
str(dl)



sentencepiece documentation built on Nov. 13, 2022, 5:05 p.m.