tokenizers: Unicode Alphabetic Tokenizer

tokenizersR Documentation

Unicode Alphabetic Tokenizer

Description

A simple Unicode alphabetic tokenizer.

Usage

Unicode_alphabetic_tokenizer(x)

Arguments

x

a character vector.

Details

Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non-alphabetic characters (i.e., the ones which do not have the Alphabetic property) are replaced by blanks, and the corresponding strings are split according to the blanks.

Value

A character vector with the tokenized strings.


Unicode documentation built on Sept. 30, 2022, 9:06 a.m.

Related to tokenizers in Unicode...