charClass: Character Classification

charClassR Documentation

Character Classification

Description

An interface to the (C99) wide character classification functions in use.

Usage

charClass(x, class)

Arguments

x

Either a UTF-8-encoded length-1 character vector or an integer vector of Unicode points (or a vector coercible to integer).

class

A character string, one of those given in the ‘Details’ section.

Details

The classification into character classes is platform-dependent. The classes are determined by internal tables on Windows and (optionally but by default) on macOS and AIX.

The character classes are interpreted as follows:

"alnum"

Alphabetic or numeric.

"alpha"

Alphabetic.

"blank"

Space or tab.

"cntrl"

Control characters.

"digit"

Digits 0-9.

"graph"

Graphical characters (printable characters except whitespace).

"lower"

Lower-case alphabetic.

"print"

Printable characters.

"punct"

Punctuation characters. Some platforms treat all non-alphanumeric graphical characters as punctuation.

"space"

Whitespace, including tabs, form and line feeds and carriage returns. Some OSes include non-breaking spaces, some exclude them.

"upper"

Upper-case alphabetic.

"xdigit"

Hexadecimal character, one of 0-9A-fa-f.

Alphabetic characters contain all lower- and upper-case ones and some others (for example, those in ‘title case’).

Whether a character is printable is used to decide whether to escape it when printing – see the help for print.default.

If x is a character string it should either be ASCII or declared as UTF-8 – see Encoding.

charClass was added in R 4.1.0. A less direct way to examine character classes which also worked in earlier versions is to use something like grepl("[[:print:]]", intToUtf8(x)) – however, the regular-expression code might not use the same classification functions as printing and on macOS used not to.

Value

A logical vector of the length the number of characters or integers in x.

Note

Non-ASCII digits are excluded by the C99 standard from the class "digit": most platforms will have them as alphabetic.

It is an assumption that the system's wide character classification functions are coded in Unicode points, but this is known to be true for all recent platforms.

In principle the classification could depend on the locale even on one platform, but that seems no longer to be seen.

See Also

Character classes are used in regular expressions.

The OS's man pages for iswctype and wctype.

Examples

x <- c(48:70, 32, 0xa0) # Last is non-breaking space
cl <- c("alnum", "alpha", "blank", "digit", "graph", "punct", "upper", "xdigit")
X <- lapply(cl, function(y) charClass(x,y)); names(X) <- cl
X <- as.data.frame(X); row.names(X) <- sQuote(intToUtf8(x, multiple = TRUE))
X

charClass("ABC123", "alpha")
## Some accented capital Greek characters
(x <- "\u0386\u0388\u0389")
charClass(x, "upper")

## How many printable characters are there? (Around 280,000 in Unicode 13.)
## There are 2^21-1 possible Unicode points (most not yet assigned).
pr <- charClass(1:0x1fffff, "print") 
table(pr)