dot-mp_tokenize_single_string: Tokenize an Input Word-by-word

Description Usage Arguments Value

Description

Tokenize an Input Word-by-word

Usage

1
.mp_tokenize_single_string(words, vocab, lookup, unk_token, max_chars)

Arguments

words

Character; a vector of words (generated by space-tokenizing a single input).

vocab

Named integer vector containing vocabulary words. Should have "vocab_split" attribute, with components named "prefixes", "words", "suffixes".

lookup

A morphemepiece lookup table.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A named integer vector of tokenized words.


morphemepiece documentation built on Dec. 11, 2021, 9:56 a.m.