Conversion of UTF-8 encoded character vectors to and from integer vectors representing a UTF-32 encoding.
object to be converted.
logical: should the conversion be to a single character string or multiple individual characters?
logical: should interpretation of
surrogate pairs be attempted? (See ‘Details’.)
Only supported for
These will work in any locale, including on platforms that do not otherwise support multi-byte character sets.
Unicode defines a name and a number of all of the glyphs it
encompasses: the numbers are called code points: since RFC3629
they run from
0x10FFFF (with about 12% being
assigned by version 10.0 of the Unicode standard).
intToUtf8 does not by default handle surrogate pairs: inputs in
the surrogate ranges are mapped to
NA. They might occur if a
UTF-16 byte stream has been read as 2-byte integers (in the correct
byte order), in which case
allow_surrogate_pairs = TRUE will
try to interpret them (with unmatched surrogate values still treated
utf8ToInt converts a length-one character string encoded in
UTF-8 to an integer vector of Unicode code points.
intToUtf8 converts a numeric vector of Unicode code points
either (default) to a single character string or a character vector of
single characters. Non-integral numeric values are truncated to
integers. For output to a single character string
silently omitted: otherwise
0 is mapped to
Encoding of a non-
NA return value is declared as
NA inputs are mapped to
Which code points are regarded as valid has changed over the lifetime
of UTF-8. Originally all 32-bit unsigned integers were potentially
valid and could be converted to up to 6 bytes in UTF-8. Since 2003 it
has been stated that there will never be valid code points larger than
0x10FFFF, and so valid UTF-8 encodings are never more than 4
The code points in the surrogate-pair range
0xDFFF are prohibited in UTF-8 and so are regarded as invalid
utf8ToInt and by default by
The position of ‘noncharacters’ (notably
0xFFFF) was clarified by ‘Corrigendum 9’ in 2013. These
are valid but will never be given an official interpretation. (In some
earlier versions of R
utf8ToInt treated them as invalid.)
https://tools.ietf.org/html/rfc3629, the current standard for UTF-8.
http://www.unicode.org/versions/corrigendum9.html for non-characters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## will only display in some locales and fonts intToUtf8(0x03B2L) # Greek beta utf8ToInt("bi\u00dfchen") utf8ToInt("\xfa\xb4\xbf\xbf\x9f") ## A valid UTF-16 surrogate pair (for U+10437) x <- c(0xD801, 0xDC37) intToUtf8(x) intToUtf8(x, TRUE) (xx <- intToUtf8(x, , TRUE)) # will only display in some locales and fonts charToRaw(xx) ## Not run: ## An example of how surrogate pairs might occur x <- "\U10437" charToRaw(x) foo <- tempfile() writeLines(x, file(foo, encoding = "UTF-16LE")) ## next two are OS-specific, but are mandated by POSIX system(paste("od -x", foo)) # 2-byte units, correct on little-endian platform system(paste("od -t x1", foo)) # single bytes as hex y <- readBin(foo, "integer", 2, 2, FALSE, endian = "little") sprintf("%X", y) intToUtf8(y, , TRUE) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.