bpe_encode: Tokenise text alongside a Byte Pair Encoding model
In tokenizers.bpe: Byte Pair Encoding Text Tokenization

View source: R/youtokentome.R

bpe_encode

R Documentation

Tokenise text alongside a Byte Pair Encoding model

Description

Tokenise text alongside a Byte Pair Encoding model

Usage

bpe_encode(
  model,
  x,
  type = c("subwords", "ids"),
  bos = FALSE,
  eos = FALSE,
  reverse = FALSE
)

Arguments

`model`	an object of class `youtokentome` as returned by `bpe_load_model`
`x`	a character vector of text to tokenise
`type`	a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'.
`bos`	logical if set to TRUE then token 'beginning of sentence' will be added
`eos`	logical if set to TRUE then token 'end of sentence' will be added
`reverse`	logical if set to TRUE the output sequence of tokens will be reversed

Examples

data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
model <- bpe(x$text, coverage = 0.999, vocab_size = 5000, threads = 1)
model
str(model$vocabulary)

text <- c("L'appartement est grand & vraiment bien situe en plein centre",
          "Proportion de femmes dans les situations de famille monoparentale.")
bpe_encode(model, x = text, type = "subwords")
bpe_encode(model, x = text, type = "ids")

encoded <- bpe_encode(model, x = text, type = "ids")
decoded <- bpe_decode(model, encoded)
decoded

## Remove the model file (Clean up for CRAN)
file.remove(model$model_path)

tokenizers.bpe documentation built on Sept. 16, 2023, 1:06 a.m.