bpe_encode: Tokenise text alongside a Byte Pair Encoding model

View source: R/youtokentome.R

bpe_encodeR Documentation

Tokenise text alongside a Byte Pair Encoding model

Description

Tokenise text alongside a Byte Pair Encoding model

Usage

bpe_encode(
  model,
  x,
  type = c("subwords", "ids"),
  bos = FALSE,
  eos = FALSE,
  reverse = FALSE
)

Arguments

model

an object of class youtokentome as returned by bpe_load_model

x

a character vector of text to tokenise

type

a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'.

bos

logical if set to TRUE then token 'beginning of sentence' will be added

eos

logical if set to TRUE then token 'end of sentence' will be added

reverse

logical if set to TRUE the output sequence of tokens will be reversed

Examples

data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
model <- bpe(x$text, coverage = 0.999, vocab_size = 5000, threads = 1)
model
str(model$vocabulary)

text <- c("L'appartement est grand & vraiment bien situe en plein centre",
          "Proportion de femmes dans les situations de famille monoparentale.")
bpe_encode(model, x = text, type = "subwords")
bpe_encode(model, x = text, type = "ids")

encoded <- bpe_encode(model, x = text, type = "ids")
decoded <- bpe_decode(model, encoded)
decoded

## Remove the model file (Clean up for CRAN)
file.remove(model$model_path)

tokenizers.bpe documentation built on Sept. 16, 2023, 1:06 a.m.