is_chinese_char: Check whether cp is the codepoint of a CJK character.

View source: R/tokenization.R

is_chinese_charR Documentation

Check whether cp is the codepoint of a CJK character.

Description

(R implementation of BasicTokenizer._is_chinese_char from BERT: tokenization.py. From that file: This defines a "chinese character" as anything in the CJK Unicode block: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)

Usage

is_chinese_char(cp)

Arguments

cp

A unicode codepoint, as an integer.

Details

Note that the CJK Unicode block is NOT all Japanese and Korean characters, despite its name. The modern Korean Hangul alphabet is a different block, as is Japanese Hiragana and Katakana. Those alphabets are used to write space-separated words, so they are not treated specially and are handled like the alphabets of the other languages.)

Value

Logical TRUE if cp is codepoint of a CJK character.


jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.