Unicode Character Objects
Data structures and basic methods for Unicode character data.
1 2 3
R objects coercible to the respective Unicode character data types, see Details.
a character string.
Package Unicode provides three basic classes for representing
u_char for vectors of Unicode characters,
u_char_range for vectors of Unicode character ranges, and
u_char_seq for vectors of Unicode character sequences. Objects
from these classes are created via the respective coercion functions.
as.u_char knows to coerce integers or hex strings (with or
without a leading 0x or the U+ typically used for
Unicode characters) giving the corresponding code points. It can also
handle Unicode character ranges, flattening them out into the
corresponding vector of Unicode characters. To “coerce” a
UTF-8 encoded R character string to the corresponding Unicode
character object, use coercion on the result of obtaining the integer
code points via
as.u_char_range knows to coerce character strings of single
Unicode characters or a Unicode range expression with the hex codes of
two Unicode characters collapsed by .. (currently, hard-wired).
It can also handle
u_char objects, coercing them to ranges of
single code points.
as.u_char_seq knows to coerce character strings with the hex
codes of Unicode characters collapsed by a non-empty
default corresponds to using , if the strings use surrounding
angles, and otherwise. If
sep is empty or has length
zero, the character strings are used as is, re-encoded in UTF-8 if
necessary, and mapped to the corresponding Unicode character sequences
as.u_char_seq can also handle
Unicode character ranges (giving the corresponding flattened out
Unicode character sequences), or lists of objects coercible to Unicode
All classes currently have
[ subscript methods. More methods will be added eventually.
u_char object giving a vector of
u_char_range object giving a
vector of Unicode character ranges.
u_char_seq object giving a
vector of Unicode character sequences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
x <- as.u_char_range(c("00AA..00AC", "01CC")) x ## Corresponding Unicode character sequence object: as.u_char_seq(x) ## Corresponding Unicode character object with all code points: as.u_char(x) ## Inspect all Unicode characters in the range: u_char_inspect(x) ## Turning R character strings into the respective Unicode character ## sequences: as.u_char_seq(c("Austria", "Trantor"), "") ## which can then be subscripted "as usual", e.g.: x <- as.u_char_seq(c("Austria", "Trantor"), "")[[1L]][c(3L, 5L)] x ## To reassemble the character strings: intToUtf8(x)