Read or set the declared encodings for a character vector.
Encoding(x) Encoding(x) <- value enc2native(x) enc2utf8(x)
A character vector.
A character vector of positive length.
Character strings in R can be declared to be encoded in
"UTF-8" or as
declarations can be read by
Encoding, which will return a
character vector of values
"unknown", or set, when
recycled as needed and other values are silently treated as
"unknown". ASCII strings will never be marked with a declared
encoding, since their representation is the same in all supported
encodings. Strings marked as
"bytes" are intended to be
non-ASCII strings which should be manipulated as bytes, and never
converted to a character encoding (so writing them to a text file is
supported only by
writeLines(useBytes = TRUE)).
enc2utf8 convert elements of character
vectors to the native encoding or UTF-8 respectively, taking any
marked encoding into account. They are primitive functions,
designed to do minimal copying.
There are other ways for character strings to acquire a declared
encoding apart from explicitly setting it (and these have changed as
R has evolved). The parser marks strings containing \u or
\U escapes. Functions
parse have an
encoding argument that is used to
iconv declares encodings from its
to argument, and console input in suitable locales is also
intToUtf8 declares its output as
"UTF-8", and output text connections (see
textConnection) are marked if running in a
suitable locale. Under some circumstances (see its help page)
source(encoding=) will mark encodings of character
strings it outputs.
Most character manipulation functions will set the encoding on output
strings if it was declared on the corresponding input. These include
strsplit(useBytes = FALSE),
toupper as well as
sub(useBytes = FALSE) and
FALSE). Note that such functions do not preserve the
encoding, but if they know the input encoding and that the string has
been successfully re-encoded (to the current encoding or UTF-8), they
mark the output.
substr does preserve the encoding, and
preserve UTF-8 encoding on systems with Unicode wide characters. With
gsub will give a marked UTF-8 result if
any of the inputs are UTF-8.
sprintf return elements marked
as bytes if any of the corresponding inputs is marked as bytes, and
otherwise marked as UTF-8 if any of the inputs is marked as UTF-8.
unique all match in UTF-8
if any of the elements are marked as UTF-8.
There is some ambiguity as to what is meant by a ‘Latin-1’ locale, since some OSes (notably Windows) make use of character positions undefined (or used for control characters) in the ISO 8859-1 character set. How such characters are interpreted is system-dependent but as from R 3.5.0 they are if possible interpreted as per Windows codepage 1252 (which Microsoft calls ‘Windows Latin 1 (ANSI)’) when converting to e.g. UTF-8.
A character vector.
enc2utf8 encodings are always marked: they are for
enc2native in UTF-8 and Latin-1 locales.
## x is intended to be in latin1 x. <- x <- "fa\xE7ile" Encoding(x.) # "unknown" (UTF-8 loc.) | "latin1" (8859-1/CP-1252 loc.) | .... Encoding(x) <- "latin1" x xx <- iconv(x, "latin1", "UTF-8") Encoding(c(x., x, xx)) c(x, xx) xb <- xx; Encoding(xb) <- "bytes" xb # will be encoded in hex cat("x = ", x, ", xx = ", xx, ", xb = ", xb, "\n", sep = "") (Ex <- Encoding(c(x.,x,xx,xb))) stopifnot(identical(Ex, c(Encoding(x.), Encoding(x), Encoding(xx), Encoding(xb))))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.