Description Usage Arguments Details Value See Also
This function tries to detect character encoding in case the language of text is known.
[THIS IS AN EXPERIMENTAL FUNCTION]
1 | stri_enc_detect2(str, locale = NULL)
|
str |
character vector, a raw vector, or a list of
|
locale |
|
Vectorized over str
.
First, the text is checked whether it is valid UTF-32BE,
UTF-32LE, UTF-16BE, UTF-16LE, UTF-8 (as in
stri_enc_detect
, this slightly bases on ICU's
i18n/csrucode.cpp
, but we do it in our own way,
however) or ASCII.
If locale
is not NA
and the above fails, the
text is checked for the number of occurrences of
language-specific code points (data provided by the ICU
library) converted to all possible 8-bit encodings that
fully cover the indicated language. The encoding is
selected basing on the greatest number of total byte hits.
The guess is of course imprecise [This is DRAFT API - still does not work as expected], as it is obtained using statistics. Because of this, detection works best if you supply at least a few hundred bytes of character data that's in a single language.
If you have no initial guess on language and encoding, try
with stri_enc_detect
(uses ICU facilities).
However, it turns out that (empirically)
stri_enc_detect2
works better than the ICU-based one
if UTF-* text is provided. Test yourself.
Just like stri_enc_detect
, this function
returns a list of length equal to the length of str
.
Each list element is a list with the following three named
components:
Encoding
– string;
guessed encodings; NA
on failure (iff
encodings
is empty),
Language
– always
NA
,
Confidence
– numeric in [0,1]; the
higher the value, the more confidence there is in the
match; NA
on failure.
The guesses are ordered w.r.t. nonincreasing confidence.
Other encoding_detection: stri_enc_detect
;
stri_enc_isascii
;
stri_enc_isutf16be
,
stri_enc_isutf16le
,
stri_enc_isutf32be
,
stri_enc_isutf32le
;
stri_enc_isutf8
;
stringi-encoding
Other locale_sensitive: stri_cmp
,
stri_compare
; stri_count_fixed
;
stri_detect_fixed
;
stri_locate_all_fixed
,
stri_locate_all_fixed,
,
stri_locate_first_fixed
,
stri_locate_first_fixed,
,
stri_locate_last_fixed
,
stri_locate_last_fixed
;
stri_opts_collator
; stri_order
,
stri_sort
;
stri_replace_all_fixed
,
stri_replace_all_fixed
,
stri_replace_first_fixed
,
stri_replace_first_fixed
,
stri_replace_last_fixed
,
stri_replace_last_fixed
;
stri_split_fixed
,
stri_split_fixed
;
stri_trans_tolower
,
stri_trans_totitle
,
stri_trans_toupper
;
stringi-locale
;
stringi-search-fixed
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.