BPEembed: Tokenise and embed text alongside a Sentencepiece and...

View source: R/bpemb.R

BPEembedR Documentation

Tokenise and embed text alongside a Sentencepiece and Word2vec model

Description

Use a sentencepiece model to tokenise text and get the embeddings of these

Usage

BPEembed(
  file_sentencepiece = x$file_model,
  file_word2vec = x$glove.bin$file_model,
  x,
  normalize = TRUE
)

Arguments

file_sentencepiece

the path to the file containing the sentencepiece model

file_word2vec

the path to the file containing the word2vec embeddings

x

the result of a call to sentencepiece_download_model. If this is provided, arguments file_sentencepiece and file_word2vec will not be used.

normalize

passed on to read.wordvectors to read in file_word2vec. Defaults to TRUE.

Value

an object of class BPEembed which is a list with elements

  • model: a sentencepiece model as loaded with sentencepiece_load_model

  • embedding: a matrix with embeddings as loaded with read.wordvectors

  • dim: the dimension of the embedding

  • n: the number of elements in the vocabulary

  • file_sentencepiece: the sentencepiece model file

  • file_word2vec: the word2vec embedding file

See Also

predict.BPEembed, sentencepiece_load_model, sentencepiece_download_model, read.wordvectors

Examples

##
## Example loading model from disk
##
folder    <- system.file(package = "sentencepiece", "models")
embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin")
model     <- file.path(folder, "nl.wiki.bpe.vs1000.model")
encoder   <- BPEembed(model, embedding)  

## Do tokenisation with the sentencepiece model + embed these
txt    <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
            "On est d'accord sur le prix de la biere?")
values <- predict(encoder, txt, type = "encode")  
str(values) 
values

txt <- rownames(values[[1]])
predict(encoder, txt, type = "decode") 
txt <- lapply(values, FUN = rownames) 
predict(encoder, txt, type = "decode") 

sentencepiece documentation built on Nov. 13, 2022, 5:05 p.m.