BPEembedder: Build a BPEembed model containing a Sentencepiece and...

View source: R/bpemb.R

BPEembedderR Documentation

Build a BPEembed model containing a Sentencepiece and Word2vec model

Description

Build a sentencepiece model on text and build a matching word2vec model on the sentencepiece vocabulary

Usage

BPEembedder(
  x,
  tokenizer = c("bpe", "char", "unigram", "word"),
  args = list(vocab_size = 8000, coverage = 0.9999),
  ...
)

Arguments

x

a data.frame with columns doc_id and text

tokenizer

character string with the type of sentencepiece tokenizer. Either 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding, Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). Passed on to sentencepiece

args

a list of arguments passed on to sentencepiece

...

arguments passed on to word2vec for training a word2vec model

Value

an object of class BPEembed which is a list with elements

  • model: a sentencepiece model as loaded with sentencepiece_load_model

  • embedding: a matrix with embeddings as loaded with read.wordvectors

  • dim: the dimension of the embedding

  • n: the number of elements in the vocabulary

  • file_sentencepiece: the sentencepiece model file

  • file_word2vec: the word2vec embedding file

See Also

sentencepiece, word2vec, predict.BPEembed

Examples

library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x     <- subset(belgium_parliament, language %in% "dutch")
model <- BPEembedder(x, tokenizer = "bpe", args = list(vocab_size = 1000),
                     type = "cbow", dim = 20, iter = 10) 
model

txt    <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.")
values <- predict(model, txt, type = "encode")  

sentencepiece documentation built on Nov. 13, 2022, 5:05 p.m.