stri_encode: Convert Strings Between Given Encodings

Description Usage Arguments Details Value References See Also

Description

These functions convert a character vector between encodings.

Usage

1
2
3
stri_encode(str, from = NULL, to = NULL, to_raw = FALSE)

stri_conv(str, from = NULL, to = NULL, to_raw = FALSE)

Arguments

str

character vector, a raw vector, or a list of raw vectors to be converted

from

input encoding: NULL or "" for default encoding or internal encoding marks usage (see Details); otherwise, a single string with encoding name, see stri_enc_list

to

target encoding: NULL or "" for default encoding (see stri_enc_get), or a single string with encoding name

to_raw

single logical value; indicates whether a list of raw vectors shall be returned rather than a character vector

Details

These two functions aim to replace R's iconv – note only it is slightly faster, and works in the same manner on all platforms. stri_conv is an alias for stri_encode.

Please, refer to stri_enc_list for the list of supported encodings and stringi-encoding for general discussion.

If from is either missing, "", or NULL and str is an atomic vector, then the input strings' encoding marks are used (just like in almost all stringi functions: bytes marks are disallowed). In other words, the input string will be converted from ASCII, UTF-8, or current default encoding, see stri_enc_get. Otherwise, the internal encoding marks are overridden by the given encoding. On the other hand, for str being a list of raw vectors, we assume that the input encoding is the current default encoding.

For to_raw=FALSE, the output strings always have marked encodings according to the target converter used (as specified by to) and the current default Encoding (ASCII, latin1, UTF-8, native, or bytes in all other cases).

Note that possible problems may occur when to is set to e.g. UTF-16 and UTF-32, as the output strings may have embedded NULs. In such cases use to_raw=TRUE and consider specifying a byte order marker (BOM) for portability reasons (e.g. set UTF-16 or UTF-32 which automatically adds BOMs).

Note that stri_encode(as.raw(data), "8bitencodingname") is a wise substitute for rawToChar.

Currently, if an incorrect code point is found on input, it is replaced by the default (for that target encoding) substitute character and a warning is generated.

Value

If to_raw is FALSE, then a character vector with encoded strings (and sensible encoding marks) is returned. Otherwise, you get a list of raw vectors.

References

Conversion – ICU User Guide, http://userguide.icu-project.org/conversion

Converters – ICU User Guide, http://userguide.icu-project.org/conversion/converters (technical details)

See Also

Other encoding_conversion: stri_enc_fromutf32; stri_enc_toascii; stri_enc_toutf32; stri_enc_toutf8; stringi-encoding


stringi documentation built on May 2, 2019, 4:54 p.m.