stri_enc_detect2: [DRAFT API] Detect Locale-Sensitive Character Encoding

Description Usage Arguments Details Value See Also

Description

This function tries to detect character encoding in case the language of text is known.

[THIS IS AN EXPERIMENTAL FUNCTION]

Usage

1

Arguments

str

character vector, a raw vector, or a list of raw vectors

locale

NULL or "" for default locale, NA for just checking the UTF-* family, or a single string with locale identifier.

Details

Vectorized over str.

First, the text is checked whether it is valid UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE, UTF-8 (as in stri_enc_detect, this slightly bases on ICU's i18n/csrucode.cpp, but we do it in our own way, however) or ASCII.

If locale is not NA and the above fails, the text is checked for the number of occurrences of language-specific code points (data provided by the ICU library) converted to all possible 8-bit encodings that fully cover the indicated language. The encoding is selected basing on the greatest number of total byte hits.

The guess is of course imprecise [This is DRAFT API - still does not work as expected], as it is obtained using statistics. Because of this, detection works best if you supply at least a few hundred bytes of character data that's in a single language.

If you have no initial guess on language and encoding, try with stri_enc_detect (uses ICU facilities). However, it turns out that (empirically) stri_enc_detect2 works better than the ICU-based one if UTF-* text is provided. Test yourself.

Value

Just like stri_enc_detect, this function returns a list of length equal to the length of str. Each list element is a list with the following three named components:

The guesses are ordered w.r.t. nonincreasing confidence.

See Also

Other encoding_detection: stri_enc_detect; stri_enc_isascii; stri_enc_isutf16be, stri_enc_isutf16le, stri_enc_isutf32be, stri_enc_isutf32le; stri_enc_isutf8; stringi-encoding

Other locale_sensitive: stri_cmp, stri_compare; stri_count_fixed; stri_detect_fixed; stri_locate_all_fixed, stri_locate_all_fixed,, stri_locate_first_fixed, stri_locate_first_fixed,, stri_locate_last_fixed, stri_locate_last_fixed; stri_opts_collator; stri_order, stri_sort; stri_replace_all_fixed, stri_replace_all_fixed, stri_replace_first_fixed, stri_replace_first_fixed, stri_replace_last_fixed, stri_replace_last_fixed; stri_split_fixed, stri_split_fixed; stri_trans_tolower, stri_trans_totitle, stri_trans_toupper; stringi-locale; stringi-search-fixed


stringi documentation built on May 2, 2019, 4:54 p.m.