stringi-search-charclass: Character Classes in 'stringi'
In stringi: THE string processing package for R

Description Details Unicode General Categories Unicode Binary Properties References See Also

In this man page we describe how character classes are declared in the stringi package so that you may search for their occurrences in your search activities.

All stri_*_charclass functions in stringi perform a single character (i.e. Unicode codepoint) search-based operations.

There are two separate ways to specify character classes in stringi:

by claiming a Unicode General Category, e.g. Lu for uppercase letters (a 1-2 letter identifier, the same may be used in regexes by specifying e.g. p{Lu})
by requesting a Unicode Binary Property, e.g. WHITE_SPACE

Both of them provide access to the ICU's Unicode Character Database and are described in detail in the sections below.

Additionally, each class identifier may be preceded with '^', which is a way to request for a complement of a given character class, i.e. it is used to match characters not in a class.

Please note that some classes may seem to overlap. However, e.g. General Category Z (some space) and Binary Property WHITE_SPACE matches different character sets.

The Unicode General Category property of a code point provides the most general classification of that code point. Each code point falls into one and only on Category.

Cc – a C0 or C1 control code;
Cf – a format control character;
Cn – a reserved unassigned code point or a non-character;
Co – a private-use character;
Cs – a surrogate code point;
Lc – the union of Lu, Ll, Lt;
Ll – a lowercase letter;
Lm – a modifier letter;
Lo – other letters, including syllables and ideographs;
Lt – a digraphic character, with first part uppercase;
Lu – an uppercase letter;
Mc – a spacing combining mark (positive advance width);
Me – an enclosing combining mark;
Mn – a non-spacing combining mark (zero advance width);
Nd – a decimal digit;
Nl – a letter-like numeric character;
No – a numeric character of other type;
Pd – a dash or hyphen punctuation mark;
Ps – an opening punctuation mark (of a pair);
Pe – a closing punctuation mark (of a pair);
Pc – a connecting punctuation mark, like a tie;
Po – a punctuation mark of other type;
Pi – an initial quotation mark;
Pf – a final quotation mark;
Sm – a symbol of mathematical use;
Sc – a currency sign;
Sk – a non-letter-like modifier symbol;
So – a symbol of other type;
Zs – a space character (of non-zero width);
Zl – U+2028 LINE SEPARATOR only;
Zp – U+2029 PARAGRAPH SEPARATOR only;
C – the union of Cc, Cf, Cs, Co, Cn;
L – the union of Lu, Ll, Lt, Lm, Lo;
M – the union of Mn, Mc, Me;
N – the union of Nd, Nl, No;
P – the union of Pc, Pd, Ps, Pe, Pi, Pf, Po;
S – the union of Sm, Sc, Sk, So;
Z – the union of Zs, Zl, Zp.

Binary properties identifiers are matched case-insensitively, and are slightly normalized. Each character may follow many Binary Properties at a time.

Here is the complete list of supported Binary Properties:

ALPHABETIC – alphabetic character;
ASCII_HEX_DIGIT – a character matching the [0-9A-Fa-f] regex;
BIDI_CONTROL – a format control which have specific functions in the Bidi (bidirectional text) Algorithm;
BIDI_MIRRORED – a character that may change display in right-to-left text;
DASH – a kind of a dash character;
DEFAULT_IGNORABLE_CODE_POINT – characters that are ignorable in most text processing activities, e.g. <2060..206F, FFF0..FFFB, E0000..E0FFF>;
DEPRECATED – a deprecated character according to the current Unicode standard (the usage of deprecated characters is strongly discouraged);
DIACRITIC – a character that linguistically modifies the meaning of another character to which it applies;
EXTENDER – a character that extends the value or shape of a preceding alphabetic character, e.g. a length and iteration mark.
FULL_COMPOSITION_EXCLUSION ;
GRAPHEME_BASE ;
GRAPHEME_EXTEND ;
GRAPHEME_LINK ;
HEX_DIGIT – a character commonly used for hexadecimal numbers, cf. also ASCII_HEX_DIGIT;
HYPHEN – a dash used to mark connections between pieces of words, plus the Katakana middle dot;
ID_CONTINUE – a character that can continue an identifier, ID_START+Mn+Mc+Nd+Pc;
ID_START – a character that can start an identifier, Lu+Ll+Lt+Lm+Lo+Nl;
IDEOGRAPHIC – a CJKV (Chinese-Japanese-Korean-Vietnamese) ideograph;
IDS_BINARY_OPERATOR ;
IDS_TRINARY_OPERATOR ;
JOIN_CONTROL ;
LOGICAL_ORDER_EXCEPTION ;
LOWERCASE ;
MATH ;
NONCHARACTER_CODE_POINT ;
QUOTATION_MARK ;
RADICAL ;
SOFT_DOTTED – a character with a “soft dot”, like i or j, such that an accent placed on this character causes the dot to disappear;
TERMINAL_PUNCTUATION – a punctuation character that generally marks the end of textual units;
UNIFIED_IDEOGRAPH ;
UPPERCASE ;
WHITE_SPACE – a space character or TAB or CR or LF or ZWSP or ZWNBSP;
XID_CONTINUE ;
XID_START ;
CASE_SENSITIVE ;
S_TERM ;
VARIATION_SELECTOR ;
NFD_INERT ;
NFKD_INERT ;
NFC_INERT ;
NFKC_INERT ;
SEGMENT_STARTER ;
PATTERN_SYNTAX ;
PATTERN_WHITE_SPACE ;
POSIX_ALNUM ;
POSIX_BLANK ;
POSIX_GRAPH ;
POSIX_PRINT ;
POSIX_XDIGIT ;
CASED ;
CASE_IGNORABLE ;
CHANGES_WHEN_LOWERCASED ;
CHANGES_WHEN_UPPERCASED ;
CHANGES_WHEN_TITLECASED ;
CHANGES_WHEN_CASEFOLDED ;
CHANGES_WHEN_CASEMAPPED ;
CHANGES_WHEN_NFKC_CASEFOLDED.