sentencepiece_download_model | R Documentation |
Download pretrained models built on Wikipedia made available at https://bpemb.h-its.org through https://github.com/bheinzerling/bpemb. These models contain Byte Pair Encoded models trained with sentencepiece as well as Glove embeddings of these Byte Pair subwords. Models for 275 languages are available.
sentencepiece_download_model( language, vocab_size, dim, model_dir = system.file(package = "sentencepiece", "models") )
language |
a character string with the language name. This can be either a plain language or a wikipedia shorthand. |
vocab_size |
integer indicating the number of tokens in the final vocabulary. Defaults to 5000. Possible values depend on the language. To inspect possible values, type sentencepiece:::.bpemb$vocab_sizes and look to your language of your choice. |
dim |
dimension of the embedding. Either 25, 50, 100, 200 or 300. |
model_dir |
path to the location where the model will be downloaded to. Defaults to |
a list with elements
language: the provided language
wikicode: the wikipedia code of the provided language
file_model: the path to the downloaded Sentencepiece model
url: the url where the Sentencepiece model was fetched from
download_failed: logical, indicating if the download failed
download_message: a character string with possible download failure information
glove: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in txt format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded
glove.bin: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in bin format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded
sentencepiece_load_model
path <- getwd() ## ## Download only the tokeniser model ## dl <- sentencepiece_download_model("Russian", vocab_size = 50000, model_dir = path) dl <- sentencepiece_download_model("English", vocab_size = 100000, model_dir = path) dl <- sentencepiece_download_model("French", vocab_size = 25000, model_dir = path) dl <- sentencepiece_download_model("multi", vocab_size = 320000, model_dir = path) dl <- sentencepiece_download_model("Vlaams", vocab_size = 1000, model_dir = path) dl <- sentencepiece_download_model("Dutch", vocab_size = 25000, model_dir = path) dl <- sentencepiece_download_model("nl", vocab_size = 25000, model_dir = path) str(dl) model <- sentencepiece_load_model(dl$file_model) ## ## Download the tokeniser model + Glove embeddings of Byte Pairs ## dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 50, model_dir = path) str(dl) model <- sentencepiece_load_model(dl$file_model) embedding <- read_word2vec(dl$glove$file_model) dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 25, model_dir = tempdir()) str(dl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.