View source: R/sentencepiece.R
sentencepiece_encode | R Documentation |
Tokenise text alongside a Sentencepiece model
sentencepiece_encode( model, x, type = c("subwords", "ids"), nbest = -1L, alpha = 0.1 )
model |
an object of class |
x |
a character vector of text (in UTF-8 Encoding) |
type |
a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'. |
nbest |
integer indicating the number of segmentations to extract. See the details. The argument is not used if you do not provide a value for it. |
alpha |
smoothing parameter to perform subword regularisation. Typical values are 0.1, 0.2 or 0.5. See the details. The argument is not used if you do not provide a value for it or do not provide a value for |
If you specify alpha
to perform subword regularisation, keep in mind the following.
When alpha is 0.0, one segmentation is uniformly sampled from the nbest
or lattice.
The best Viterbi segmentation is more likely sampled when setting larger alpha
values like 0.1.
If you provide a positive value for nbest
, approximately samples one segmentation from nbest
candidates.
If you provide a negative value for nbest
, samples one segmentation from the hypotheses (Lattice) according to the generation probabilities using forward-filtering and backward-sampling algorithm.
nbest
and alpha
correspond respectively to the parameter l
and in alpha
in the paper https://arxiv.org/abs/1804.10959 where (nbest
< 0 means l = infinity).
If the model is a BPE model, alpha
is the merge probability p
explained in https://arxiv.org/abs/1910.13267.
In a BPE model, nbest-based sampling is not supported so the nbest parameter is ignored although
it still needs to be provided if you want to make use of alpha
.
a list with tokenised text, one for each element of x
unless you provide nbest
without providing alpha
in which case the result is a list of list of nbest
tokenised texts
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model") model <- sentencepiece_load_model(file = model) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids") ## Examples using subword regularisation model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model") model <- sentencepiece_load_model(file = model) txt <- c("Goed zo", "On est d'accord") sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4) sentencepiece_encode(model, x = txt, type = "ids", nbest = 4) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 2) sentencepiece_encode(model, x = txt, type = "ids", nbest = 2) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 1) sentencepiece_encode(model, x = txt, type = "ids", nbest = 1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "ids", nbest = 4, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0) sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.