predict.BPEembed: Encode and Decode alongside a BPEembed model

View source: R/bpemb.R

predict.BPEembedR Documentation

Encode and Decode alongside a BPEembed model

Description

Use the sentencepiece model to either

  • encode: tokenise and embed text

  • decode: get the untokenised text back of tokenised data

  • tokenize: only tokenize alongside the sentencepiece model

Usage

## S3 method for class 'BPEembed'
predict(object, newdata, type = c("encode", "decode", "tokenize"), ...)

Arguments

object

an object of class BPEembed as returned by BPEembed

newdata

a character vector of text to encode or a character vector of encoded tokens to decode or a list of those

type

character string, either 'encode', 'decode' or 'tokenize'

...

further arguments passed on to the methods

Value

  • in case type is set to 'encode': a list of matrices containing embeddings of the text which is tokenised with sentencepiece_encode

  • in case type is set to 'decode': a character vector of decoded text as returned by sentencepiece_decode

  • in case type is set to 'tokenize': a tokenised sentencepiece_encode

See Also

BPEembed, sentencepiece_decode, sentencepiece_encode

Examples

embedding <- system.file(package = "sentencepiece", "models", 
                         "nl.wiki.bpe.vs1000.d25.w2v.bin")
model     <- system.file(package = "sentencepiece", "models", 
                         "nl.wiki.bpe.vs1000.model")    
encoder   <- BPEembed(model, embedding)  

txt      <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
              "On est d'accord sur le prix de la biere?")
values   <- predict(encoder, txt, type = "encode")  
str(values) 
values

txt <- rownames(values[[1]])
predict(encoder, txt, type = "decode") 
txt <- lapply(values, FUN = rownames) 
predict(encoder, txt, type = "decode") 
txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
         "On est d'accord sur le prix de la biere?")
predict(encoder, txt, type = "tokenize", "subwords") 
predict(encoder, txt, type = "tokenize", "ids")  

sentencepiece documentation built on Nov. 13, 2022, 5:05 p.m.