BPEembedder | R Documentation |
Build a sentencepiece model on text and build a matching word2vec model on the sentencepiece vocabulary
BPEembedder( x, tokenizer = c("bpe", "char", "unigram", "word"), args = list(vocab_size = 8000, coverage = 0.9999), ... )
x |
a data.frame with columns doc_id and text |
tokenizer |
character string with the type of sentencepiece tokenizer. Either 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding,
Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). Passed on to |
args |
a list of arguments passed on to |
... |
arguments passed on to |
an object of class BPEembed which is a list with elements
model: a sentencepiece model as loaded with sentencepiece_load_model
embedding: a matrix with embeddings as loaded with read.wordvectors
dim: the dimension of the embedding
n: the number of elements in the vocabulary
file_sentencepiece: the sentencepiece model file
file_word2vec: the word2vec embedding file
sentencepiece
, word2vec
, predict.BPEembed
library(tokenizers.bpe) data(belgium_parliament, package = "tokenizers.bpe") x <- subset(belgium_parliament, language %in% "dutch") model <- BPEembedder(x, tokenizer = "bpe", args = list(vocab_size = 1000), type = "cbow", dim = 20, iter = 10) model txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.") values <- predict(model, txt, type = "encode")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.