dot-mp_tokenize_word_bidir: Tokenize a Word Bidirectionally

.mp_tokenize_word_bidirR Documentation

Tokenize a Word Bidirectionally

Description

Apply .mp_tokenize_word from both directions and pick the result with fewer pieces.

Usage

.mp_tokenize_word_bidir(
  word,
  vocab_split,
  unk_token,
  max_chars,
  allow_compounds = TRUE
)

Arguments

word

Character scalar; word to tokenize.

vocab_split

List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes".

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

allow_compounds

Logical; whether to allow multiple whole words in the breakdown. Default is TRUE. This option will not be exposed to end users; it is kept here for documentation + development purposes.

Value

Input word as a list of tokens.


morphemepiece documentation built on April 16, 2022, 5:05 p.m.