u_char_basics: Unicode Character Objects

u_char_basicsR Documentation

Unicode Character Objects

Description

Data structures and basic methods for Unicode character data.

Usage

as.u_char(x)
as.u_char_range(x)
as.u_char_seq(x, sep = NA_character_)

Arguments

x

R objects coercible to the respective Unicode character data types, see Details.

sep

a character string.

Details

Package Unicode provides three basic classes for representing Unicode characters: u_char for vectors of Unicode characters, u_char_range for vectors of Unicode character ranges, and u_char_seq for vectors of Unicode character sequences. Objects from these classes are created via the respective coercion functions.

as.u_char knows to coerce integers or hex strings (with or without a leading ‘⁠0x⁠’ or the ‘⁠U+⁠’ typically used for Unicode characters) giving the corresponding code points. It can also handle Unicode character ranges, flattening them out into the corresponding vector of Unicode characters. To “coerce” a UTF-8 encoded R character string to the corresponding Unicode character object, use coercion on the result of obtaining the integer code points via utf8ToInt.

as.u_char_range knows to coerce character strings of single Unicode characters or a Unicode range expression with the hex codes of two Unicode characters collapsed by ‘⁠..⁠’ (currently, hard-wired). It can also handle u_char objects, coercing them to ranges of single code points.

as.u_char_seq knows to coerce character strings with the hex codes of Unicode characters collapsed by a non-empty sep. The default corresponds to using ‘⁠,⁠’ if the strings use surrounding angles, and ‘⁠ ⁠’ otherwise. If sep is empty or has length zero, the character strings are used as is, re-encoded in UTF-8 if necessary, and mapped to the corresponding Unicode character sequences using utf8ToInt. as.u_char_seq can also handle Unicode character ranges (giving the corresponding flattened out Unicode character sequences), or lists of objects coercible to Unicode characters via as.u_char.

All classes currently have as.character, as.data.frame, c, format, print, rep, unique and [ subscript methods. More methods will be added eventually.

Value

For as.u_char, a u_char object giving a vector of Unicode characters.

For as.u_char_range, a u_char_range object giving a vector of Unicode character ranges.

For as.u_char_seq, a u_char_seq object giving a vector of Unicode character sequences.

References

Unicode Character Database (https://www.unicode.org/ucd/),
https://en.wikipedia.org/wiki/Unicode

Examples

x <- as.u_char_range(c("00AA..00AC", "01CC"))
x
## Corresponding Unicode character sequence object:
as.u_char_seq(x)
## Corresponding Unicode character object with all code points:
as.u_char(x)
## Inspect all Unicode characters in the range:
u_char_inspect(x)

## Turning R character strings into the respective Unicode character
## sequences:
as.u_char_seq(c("Austria", "Trantor"), "")
## which can then be subscripted "as usual", e.g.:
x <- as.u_char_seq(c("Austria", "Trantor"), "")[[1L]][c(3L, 5L)]
x
## To reassemble the character strings:
intToUtf8(x)

Unicode documentation built on May 29, 2024, 2:36 a.m.