Converts character strings with declared marked encodings to UTF-8 strings.
stri_enc_toutf8(str, is_unknown_8bit = FALSE, validate = FALSE)
a character vector to be converted
a single logical value, see Details
a single logical value (can be
is_unknown_8bit is set to
FALSE (the default),
then R encoding marks are used, see
Bytes-marked strings will cause the function to fail.
If a string is in UTF-8 and has a byte order mark (BOM), then the BOM will be silently removed from the output string.
If the default encoding is UTF-8, see
then strings marked with
native are – for efficiency reasons –
returned as-is, i.e., with unchanged markings.
A similar behavior is observed when calling
is_unknown_8bit=TRUE, if a string is declared to be neither
in ASCII nor in UTF-8, then all byte codes > 127 are replaced with
the Unicode REPLACEMENT CHARACTER (\Ufffd).
Note that the REPLACEMENT CHARACTER may be interpreted as Unicode
missing value for single characters.
bytes-marked string is assumed to use an 8-bit encoding
that extends the ASCII map.
What is more, setting
NA in both cases validates the resulting UTF-8 byte stream.
in case of any incorrect byte sequences, they will be
replaced with the REPLACEMENT CHARACTER.
This option may be used in a case
where you want to fix an invalid UTF-8 byte sequence.
NA, a bogus string will be replaced with a missing value.
Returns a character vector.
Marek Gagolewski and other contributors
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi: 10.18637/jss.v103.i02
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.