tokenize_chinese_chars: Add whitespace around any CJK character.

View source: R/tokenization.R

tokenize_chinese_charsR Documentation

Add whitespace around any CJK character.

Description

(R implementation of BasicTokenizer._tokenize_chinese_chars from BERT: tokenization.py.) This may result in doubled-up spaces, but that's the behavior of the python code...

Usage

tokenize_chinese_chars(text)

Arguments

text

A character scalar.

Value

Text with spaces around CJK characters.


jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.