dot-mp_tokenize_word_bidir: Tokenize a Word Bidirectionally

Description Usage Arguments Value

Description

Apply .mp_tokenize_word from both directions and pick the result with fewer pieces.

Usage

1
2
3
4
5
6
7
.mp_tokenize_word_bidir(
  word,
  vocab,
  unk_token,
  max_chars,
  allow_compounds = TRUE
)

Arguments

word

Character scalar; word to tokenize.

vocab

Named integer vector containing vocabulary words. Should have "vocab_split" attribute, with components named "prefixes", "words", "suffixes".

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

allow_compounds

Logical; whether to allow multiple whole words in the breakdown. Default is TRUE. This option will not be exposed to end users; it is kept here for documentation + development purposes.

Value

Input word as a list of tokens.


morphemepiece documentation built on Dec. 11, 2021, 9:56 a.m.