Convert Character Vector between Encodings
This uses system facilities to convert a character vector between encodings: the ‘i’ stands for ‘internationalization’.
1 2 3
A character vector, or an object to be converted to a character
A character string describing the current encoding.
A character string describing the target encoding.
character string. If not
logical, for expert use. Should encodings be marked?
logical. Should a list of raw vectors be returned rather than a character vector?
The names of encodings and which ones are available are
platform-dependent. All R platforms support
"" (for the
encoding of the current locale),
Generally case is ignored when specifying an encoding.
On most platforms
iconvlist provides an alphabetical list of
the supported encodings. On others, the information is on the man
iconv(5) or elsewhere in the man pages (but beware
that the system command
iconv may not support the same set of
encodings as the C functions R calls). Unfortunately, the names are
rarely supported across all platforms.
x which cannot be converted (perhaps because they
are invalid or because they cannot be represented in the target
encoding) will be returned as
sub is specified.
Most versions of
iconv will allow transliteration by appending
//TRANSLIT to the
to encoding: see the examples.
"ASCII" is accepted, and on most systems
"POSIX" are synonyms for ASCII.
Any encoding bits (see
Encoding) on elements of
are ignored: they will always be translated as if from encoding
from even if declared otherwise.
enc2utf8 provide alternatives which do take declared
encodings into account.
Note that implementations of
iconv typically do not do much
validity checking and will often mis-convert inputs which are invalid
toRaw = FALSE (the default), the value is a character vector
of the same length and the same attributes as
conversion to a character vector).
mark = TRUE (the default) the elements of the result have a
declared encoding if
to = "" and the current locale's encoding is detected as
Latin-1 (or its superset CP1252 on Windows) or UTF-8.
toRaw = TRUE, the value is a list of the same length and
the same attributes as
x whose elements are either
(if conversion fails) or a raw vector.
iconvlist(), a character vector (typically of a few hundred
elements) of known encoding names.
There are three main implementations of
iconv in use.
Linux's C runtime glibc contains one. Several platforms
supply GNU libiconv, including macOS, FreeBSD and Cygwin, in
some cases with additional encodings. On Windows we use a version of
Yukihiro Nakadaira's win_iconv, which is based on Windows'
codepages. (We have added many encoding names for compatibility
with other systems.) All three have
iconvlist, ignore case in
encoding names and support //TRANSLIT (but with different
results, and for win_iconv currently a ‘best fit’
strategy is used except for
to = "ASCII").
Most commercial Unixes contain an implemetation of
none we have encountered have supported the encoding names we need:
the “R Installation and Administration Manual” recommends
installing GNU libiconv on Solaris and AIX, for example.
There are other implementations, e.g. NetBSD has used one from the Citrus project (which does not support //TRANSLIT) and there is an older FreeBSD port (libiconv is usually used there): it has not been reported whether or not these work with R.
Note that you cannot rely on invalid inputs being detected, especially
to = "ASCII" where some implementations allow 8-bit
characters and pass them through unchanged or with transliteration.
Some of the implementations have interesting extra encodings: for
example GNU libiconv allows
to = "C99" to use
\uxxx escapes for non-ASCII characters.
Byte Order Marks
most commonly known as ‘BOMs’.
Encodings using character units which are more than one byte in size
can be written on a file in either big-endian or little-endian order:
this applies most commonly to UCS-2, UTF-16 and UTF-32/UCS-4
encodings. Some systems will write the Unicode character
U+FEFF at the beginning of a file in these encodings and
perhaps also in UTF-8. In that usage the character is known as a BOM,
and should be handled during input (see the ‘Encodings’ section
connection: re-encoded connections have some
special handling of BOMs). The rest of this section applies when this
has not been done so
x starts with a BOM.
Implementations will generally interpret a BOM for
as one of
"UTF-32". Implementations differ in how they treat BOMs in
x in other
from encodings: they may be discarded,
returned as character
U+FEFF or regarded as invalid.
The only reasonably portable name for the ISO 8859-15 encoding,
commonly known as ‘Latin 9’, is
"latin9" but GNU libiconv does not.
"utf8" is converted to
to (as from R 3.0.3) by
iconv, but not
the official (and most widely supported) name for ‘Mac Roman’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## In principle, as not all systems have iconvlist try(utils::head(iconvlist(), n = 50)) ## Not run: ## convert from Latin-2 to UTF-8: two of the glibc iconv variants. iconv(x, "ISO_8859-2", "UTF-8") iconv(x, "LATIN2", "UTF-8") ## End(Not run) ## Both x below are in latin1 and will only display correctly in a ## locale that can represent and display latin1. x <- "fa\xE7ile" Encoding(x) <- "latin1" x charToRaw(xx <- iconv(x, "latin1", "UTF-8")) xx iconv(x, "latin1", "ASCII") # NA iconv(x, "latin1", "ASCII", "?") # "fa?ile" iconv(x, "latin1", "ASCII", "") # "faile" iconv(x, "latin1", "ASCII", "byte") # "fa<e7>ile" ## Extracts from old R help files (they are nowadays in UTF-8) x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") Encoding(x) <- "latin1" x try(iconv(x, "latin1", "ASCII//TRANSLIT")) # platform-dependent iconv(x, "latin1", "ASCII", sub = "byte") ## and for Windows' 'Unicode' str(xx <- iconv(x, "latin1", "UTF-16LE", toRaw = TRUE)) iconv(xx, "UTF-16LE", "UTF-8")
Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.