Description Details Unicode General Categories Unicode Binary Properties References See Also
In this man page we describe how character classes are declared in the stringi package so that you may search for their occurrences in your search activities.
All stri_*_charclass
functions in stringi
perform a single character (i.e. Unicode codepoint)
search-based operations.
There are two separate ways to specify character classes in stringi:
by claiming a Unicode
General Category, e.g. Lu
for uppercase letters (a
1-2 letter identifier, the same may be used in regexes by
specifying e.g. p{Lu}
)
by requesting a Unicode
Binary Property, e.g. WHITE_SPACE
Both of them provide access to the ICU's Unicode Character Database and are described in detail in the sections below.
Additionally, each class identifier may be preceded with '^', which is a way to request for a complement of a given character class, i.e. it is used to match characters not in a class.
Please note that some classes may seem to overlap. However,
e.g. General Category Z
(some space) and Binary
Property WHITE_SPACE
matches different character
sets.
The Unicode General Category property of a code point provides the most general classification of that code point. Each code point falls into one and only on Category.
Cc
– a C0 or C1 control code;
Cf
– a format control character;
Cn
– a reserved unassigned code point or a
non-character;
Co
– a private-use
character;
Cs
– a surrogate code point;
Lc
– the union of Lu, Ll, Lt;
Ll
– a lowercase letter;
Lm
– a
modifier letter;
Lo
– other letters,
including syllables and ideographs;
Lt
– a
digraphic character, with first part uppercase;
Lu
– an uppercase letter;
Mc
– a
spacing combining mark (positive advance width);
Me
– an enclosing combining mark;
Mn
– a non-spacing combining mark (zero advance width);
Nd
– a decimal digit;
Nl
– a
letter-like numeric character;
No
– a
numeric character of other type;
Pd
– a
dash or hyphen punctuation mark;
Ps
– an
opening punctuation mark (of a pair);
Pe
–
a closing punctuation mark (of a pair);
Pc
– a connecting punctuation mark, like a tie;
Po
– a punctuation mark of other type;
Pi
– an initial quotation mark;
Pf
– a final quotation mark;
Sm
– a symbol of
mathematical use;
Sc
– a currency sign;
Sk
– a non-letter-like modifier symbol;
So
– a symbol of other type;
Zs
– a space character (of non-zero width);
Zl
– U+2028 LINE SEPARATOR only;
Zp
– U+2029 PARAGRAPH SEPARATOR only;
C
– the
union of Cc, Cf, Cs, Co, Cn;
L
– the union
of Lu, Ll, Lt, Lm, Lo;
M
– the union of Mn,
Mc, Me;
N
– the union of Nd, Nl, No;
P
– the union of Pc, Pd, Ps, Pe, Pi, Pf, Po;
S
– the union of Sm, Sc, Sk, So;
Z
– the union of Zs, Zl, Zp.
Binary properties identifiers are matched case-insensitively, and are slightly normalized. Each character may follow many Binary Properties at a time.
Here is the complete list of supported Binary Properties:
ALPHABETIC
– alphabetic
character;
ASCII_HEX_DIGIT
– a character
matching the [0-9A-Fa-f]
regex;
BIDI_CONTROL
– a format control which have
specific functions in the Bidi (bidirectional text)
Algorithm;
BIDI_MIRRORED
– a character that
may change display in right-to-left text;
DASH
– a kind of a dash character;
DEFAULT_IGNORABLE_CODE_POINT
– characters that
are ignorable in most text processing activities, e.g.
<2060..206F, FFF0..FFFB, E0000..E0FFF>;
DEPRECATED
– a deprecated character according to
the current Unicode standard (the usage of deprecated
characters is strongly discouraged);
DIACRITIC
– a character that linguistically
modifies the meaning of another character to which it
applies;
EXTENDER
– a character that
extends the value or shape of a preceding alphabetic
character, e.g. a length and iteration mark.
FULL_COMPOSITION_EXCLUSION
;
GRAPHEME_BASE
;
GRAPHEME_EXTEND
;
GRAPHEME_LINK
;
HEX_DIGIT
– a
character commonly used for hexadecimal numbers, cf. also
ASCII_HEX_DIGIT
;
HYPHEN
– a dash
used to mark connections between pieces of words, plus
the Katakana middle dot;
ID_CONTINUE
– a
character that can continue an identifier,
ID_START
+Mn
+Mc
+Nd
+Pc
;
ID_START
– a character that can start an
identifier,
Lu
+Ll
+Lt
+Lm
+Lo
+Nl
;
IDEOGRAPHIC
– a CJKV
(Chinese-Japanese-Korean-Vietnamese) ideograph;
IDS_BINARY_OPERATOR
;
IDS_TRINARY_OPERATOR
;
JOIN_CONTROL
;
LOGICAL_ORDER_EXCEPTION
;
LOWERCASE
;
MATH
;
NONCHARACTER_CODE_POINT
;
QUOTATION_MARK
;
RADICAL
;
SOFT_DOTTED
– a character with a “soft dot”,
like i or j, such that an accent placed on this character
causes the dot to disappear;
TERMINAL_PUNCTUATION
– a punctuation character
that generally marks the end of textual units;
UNIFIED_IDEOGRAPH
;
UPPERCASE
;
WHITE_SPACE
– a space character or TAB or CR or
LF or ZWSP or ZWNBSP;
XID_CONTINUE
;
XID_START
;
CASE_SENSITIVE
;
S_TERM
;
VARIATION_SELECTOR
;
NFD_INERT
;
NFKD_INERT
;
NFC_INERT
;
NFKC_INERT
;
SEGMENT_STARTER
;
PATTERN_SYNTAX
;
PATTERN_WHITE_SPACE
;
POSIX_ALNUM
;
POSIX_BLANK
;
POSIX_GRAPH
;
POSIX_PRINT
;
POSIX_XDIGIT
;
CASED
;
CASE_IGNORABLE
;
CHANGES_WHEN_LOWERCASED
;
CHANGES_WHEN_UPPERCASED
;
CHANGES_WHEN_TITLECASED
;
CHANGES_WHEN_CASEFOLDED
;
CHANGES_WHEN_CASEMAPPED
;
CHANGES_WHEN_NFKC_CASEFOLDED
.
The Unicode Character Database – Unicode Standard Annex #44, http://www.unicode.org/reports/tr44/
Other search_charclass: stri_count_charclass
;
stri_detect_charclass
;
stri_extract_all_charclass
,
stri_extract_all_charclass
,
stri_extract_first_charclass
,
stri_extract_first_charclass
,
stri_extract_last_charclass
,
stri_extract_last_charclass
;
stri_locate_all_charclass
,
stri_locate_all_charclass
,
stri_locate_first_charclass
,
stri_locate_first_charclass
,
stri_locate_last_charclass
,
stri_locate_last_charclass
;
stri_replace_all_charclass
,
stri_replace_all_charclass
,
stri_replace_first_charclass
,
stri_replace_first_charclass
,
stri_replace_last_charclass
,
stri_replace_last_charclass
;
stri_split_charclass
,
stri_split_charclass
;
stri_trim
, stri_trim
,
stri_trim_both
, stri_trim_left
,
stri_trim_right
; stringi-search
Other stringi_general_topics:
stringi-arguments
;
stringi-encoding
;
stringi-locale
;
stringi-package
;
stringi-search-fixed
;
stringi-search-regex
;
stringi-search
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.