dot-mp_tokenize_word_bidir: Tokenize a Word Bidirectionally
In macmillancontentscience/morphemepiece: Morpheme Tokenization

.mp_tokenize_word_bidir

R Documentation

Tokenize a Word Bidirectionally

Apply .mp_tokenize_word from both directions and pick the result with fewer pieces.

.mp_tokenize_word_bidir(
  word,
  vocab_split,
  unk_token,
  max_chars,
  allow_compounds = TRUE
)

`word`	Character scalar; word to tokenize.
`vocab_split`	List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes".
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.
`allow_compounds`	Logical; whether to allow multiple whole words in the breakdown. Default is TRUE. This option will not be exposed to end users; it is kept here for documentation + development purposes.

Input word as a list of tokens.

macmillancontentscience/morphemepiece documentation built on April 19, 2022, 2:20 p.m.

macmillancontentscience/morphemepiece index

Note that we can't provide technical support on individual packages. You should contact the package authors for that.