morphemepiece_tokenize: Tokenize Sequence with Morpheme Pieces

View source: R/tokenize.R

morphemepiece_tokenizeR Documentation

Tokenize Sequence with Morpheme Pieces

Description

Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.

Usage

morphemepiece_tokenize(
  text,
  vocab = morphemepiece_vocab(),
  lookup = morphemepiece_lookup(),
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

text

Character scalar; text to tokenize.

vocab

A morphemepiece vocabulary.

lookup

A morphemepiece lookup table.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)


macmillancontentscience/morphemepiece documentation built on April 19, 2022, 2:20 p.m.