Description Details UTF-8 and UTF-16 Character Encodings in R Encoding Conversion Encoding Detection References See Also
This manual page explains how to deal with different character encodings in stringi. In particular you should note that:
Functions in stringi process each string internally in Unicode, which is a superset of all character representation schemes. Even if a string is given in the native encoding, i.e. your platform's default one, it will be converted to Unicode.
Most functions always return UTF-8 encoded strings, regardless of the input encoding.
"Hundreds of encodings have been developed over the years, each for small groups of languages and for special purposes. As a result, the interpretation of text, input, sorting, display, and storage depends on the knowledge of all the different types of character sets and their encodings. Programs have been written to handle either one single encoding at a time and switch between them, or to convert between external and internal encodings."
"Unicode provides a single character set that covers the major languages of the world, and a small number of machine-friendly encoding forms and schemes to fit the needs of existing applications and protocols. It is designed for best interoperability with both ASCII and ISO-8859-1 (the most widely used character sets) to make it easier for Unicode to be used in almost all applications and protocols" (see the ICU User Guide).
The Unicode Standard determines the way to map any possible character to a numeric value – a so-called code point. Such code points, however, have to be stored somehow in computer's memory. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit integer (cf. the ICU FAQ).
In most cases, Unicode is a superset of the characters supported by any given codepage.
The UTF-8 encoding is the most natural choice for
representing Unicode characters in R. UTF-8 has ASCII as
its subset (code points 1–127 are the same in both of
them). Code points larger than 127 are represented by
multi-byte sequences (from 2 to 4 bytes: not all
sequences of bytes are valid UTF-8, cf.
stri_enc_isutf8
).
Most of the computations in stringi are performed internally using either UTF-8 or UTF-16 encodings (this depends on type of service you request: some ICU services are designed to work only with UTF-16). Thanks to that choice, with stringi you get the same result on each platform, which is – unfortunately – not the case of base R's functions (it is for example known that performing a regular expression search under Linux on some texts may give you a different result to those obtained under Windows). We really had portability in our minds while developing our package!
We have observed that R correctly handles UTF-8 strings regardless of your platform's Native encoding (see below). Therefore, we decided that most functions in stringi will output its results in UTF-8 – this speeds ups computations on cascading calls to our functions: the strings does not have to be re-encoded each time.
Note that some Unicode characters may have an ambiguous
representation. For example, “a with ogonek” (one
character) and “a”+“ogonek” (two graphemes) are
semantically the same. stringi provides functions
to normalize character sequences,
stri_enc_nfc
for discussion. However,
denormalized strings do appear very rarely in typical
string processing activities.
You should keep in mind that data in memory are just bytes (small integer values) – an encoding is a way to represent characters with such numbers, it is a semantic "key" to understand a given byte sequence. For example, in ISO-8859-2 (Central European), the value 177 represents Polish “a with ogonek”, and in ISO-8859-1 (Western European), the same value meas the “plus-minus” sign. Thus, a character encoding is a translation scheme: we need to communicate with R somehow, relying on how it represents strings.
Basically, R has a very simple encoding-marking
mechanism, see Encoding. There is an implicit
assumption that your platform's default (native) encoding
is always an 8-bit one and it is a superset of ASCII –
stringi checks that when your native encoding is
being detected automatically on ICU initialization
and each time when you change it manually by calling
stri_enc_set
.
Character strings in R (internally) can be declared to be in:
ASCII (here, strings consist only of bytes codes not greater than 127);
"UTF-8"
;
"latin1"
, i.e. ISO-8859-1
(Western European).
Moreover, there are two other cases:
"bytes"
– strings should
be manipulated as bytes; encoding is not set;
"unknown"
(quite misleading name: no explicit
encoding mark) – strings are assumed to be in your
platform's native (default) encoding.
Native strings often appear as result of inputing a
string from keyboard or file. This makes sense: you
operating system works in some encoding and provides R
with some data. Each time when a stringi function
encounters a native string, it assumes that data should
be translated from the default encoding, i.e. the one
returned by stri_enc_get
(default encoding
should only be changed if autodetect fails on
stringi load).
Functions which allow "bytes"
encoding markings
are very rare in stringi, and were carefully
selected. These are: stri_enc_toutf8
(with
argument is_unknown_8bit=TRUE
),
stri_enc_toascii
, and
stri_encode
.
Apart from automatic conversion from the native encoding,
you may re-encode a string manually, for example when you
load it from a file saved in different platform. Call
stri_enc_list
for the list of encodings
supported by ICU. Note that converter names are
case-insensitive and ICU tries to normalize the
encoding specifiers. Leading zeroes are ignored in
sequences of digits (if further digits follow), and all
non-alphanumeric characters are ignored. Thus the strings
"UTF-8", "utf_8", "u*Tf08" and "Utf 8" are equivalent.
The stri_encode
function allows you to
convert between any given encodings (in some cases you
will obtain "bytes"
-marked strings, or even lists
of raw vectors (i.e. for UTF-16). There are also some
useful more specialized functions, like
stri_enc_toutf32
(converts a character
vector to a list of integers, where one code point is
exactly one numeric value) or
stri_enc_toascii
(substitutes all non-ASCII
bytes with the SUBSTITUTE CHARACTER, which plays a
similar role as R's NA
value).
There are also some routines for automated encoding
detection, see e.g. stri_enc_detect
(for
ICU-provided facilities) or
stri_enc_detect2
for our own,
locale-sensitive solution.
Given a text file, one has to know how to interpret (encode) raw data in order to obtain meaningful information.
Encoding detection is always an imprecise operation and needs a considerable amount of data. However, in case of some encodings (like UTF-8, ASCII, or UTF-32) a “false positive” byte sequence is quite rare (statistically).
Check out stri_enc_detect
and
stri_enc_detect2
(among others) for useful
functions from this category.
Unicode Basics – ICU User Guide, http://userguide.icu-project.org/unicode
Conversion – ICU User Guide, http://userguide.icu-project.org/conversion
Converters – ICU User Guide, http://userguide.icu-project.org/conversion/converters (technical details)
UTF-8, UTF-16, UTF-32 & BOM – ICU FAQ, http://www.unicode.org/faq/utf_bom.html
Other encoding_conversion: stri_conv
,
stri_encode
;
stri_enc_fromutf32
;
stri_enc_toascii
;
stri_enc_toutf32
;
stri_enc_toutf8
Other encoding_detection: stri_enc_detect2
;
stri_enc_detect
;
stri_enc_isascii
;
stri_enc_isutf16be
,
stri_enc_isutf16le
,
stri_enc_isutf32be
,
stri_enc_isutf32le
;
stri_enc_isutf8
Other encoding_management: stri_enc_get
,
stri_enc_set
; stri_enc_info
;
stri_enc_list
Other encoding_normalization: stri_enc_isnfc
,
stri_enc_isnfd
,
stri_enc_isnfkc
,
stri_enc_isnfkc_casefold
,
stri_enc_isnfkd
, stri_enc_nfc
,
stri_enc_nfd
, stri_enc_nfkc
,
stri_enc_nfkc_casefold
,
stri_enc_nfkd
Other stringi_general_topics:
stringi-arguments
;
stringi-locale
;
stringi-package
;
stringi-search-charclass
;
stringi-search-fixed
;
stringi-search-regex
;
stringi-search
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.