stri_enc_detect: [DRAFT API] Detect Character Set and Language
In stringi: THE string processing package for R

Description Usage Arguments Details Value References See Also Examples

This function uses the ICU engine to determine the character set, or encoding, of character data in an unknown format.

1	stri_enc_detect(str, filter_angle_brackets = FALSE)

`str`	character vector, a raw vector, or a list of `raw` vectors
`filter_angle_brackets`	logical; If filtering is enabled, text within angle brackets ("<" and ">") will be removed before detection, which will remove most HTML or XML markup.

Vectorized over str and filter_angle_brackets.

This is, at best, an imprecise operation using statistics and heuristics. Because of this, detection works best if you supply at least a few hundred bytes of character data that's mostly in a single language. However, Because the detection only looks at a limited amount of the input byte data, some of the returned charsets may fail to handle the all of input data. Note that in some cases, the language can be determined along with the encoding.

Several different techniques are used for character set detection. For multi-byte encodings, the sequence of bytes is checked for legal patterns. The detected characters are also check against a list of frequently used characters in that encoding. For single byte encodings, the data is checked against a list of the most commonly occurring three letter groups for each language that can be written using that encoding.

The detection process can be configured to optionally ignore HTML or XML style markup (using ICU's internal facilities), which can interfere with the detection process by changing the statistics.

This function should most often be used for byte-marked input strings, especially after loading them from text files and before the main conversion with stri_encode. The input encoding is of course not taken into account here, even if marked.

The following table shows all the encodings that can be detected:

Character_Set	Languages
UTF-8	--
UTF-16BE	--
UTF-16LE	--
UTF-32BE	--
UTF-32LE	--
Shift_JIS	Japanese
ISO-2022-JP	Japanese
ISO-2022-CN	Simplified Chinese
ISO-2022-KR	Korean
GB18030	Chinese
Big5	Traditional Chinese
EUC-JP	Japanese
EUC-KR	Korean
ISO-8859-1	Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish
ISO-8859-2	Czech, Hungarian, Polish, Romanian
ISO-8859-5	Russian
ISO-8859-6	Arabic
ISO-8859-7	Greek
ISO-8859-8	Hebrew
ISO-8859-9	Turkish
windows-1250	Czech, Hungarian, Polish, Romanian
windows-1251	Russian
windows-1252	Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish
windows-1253	Greek
windows-1254	Turkish
windows-1255	Hebrew
windows-1256	Arabic
KOI8-R	Russian
IBM420	Arabic
IBM424	Hebrew

If you have some initial guess on language and encoding, try with stri_enc_detect2.

Returns a list of length equal to the length of str. Each list element is a list with the following three named vectors representing all guesses:

Encoding – string; guessed encodings; NA on failure,
Language – string; guessed languages; NA if the language could not be determined (e.g. in case of UTF-8),
Confidence – numeric in [0,1]; the higher the value, the more confidence there is in the match; NA on failure.

The guesses are ordered w.r.t. nonincreasing confidence.

Character Set Detection – ICU User Guide, http://userguide.icu-project.org/conversion/detection

Other encoding_detection: stri_enc_detect2; stri_enc_isascii; stri_enc_isutf16be, stri_enc_isutf16le, stri_enc_isutf32be, stri_enc_isutf32le; stri_enc_isutf8; stringi-encoding

## Not run: 
f <- rawToChar(readBin("test.txt", "raw", 1024))
stri_enc_detect(f)

## End(Not run)

stringi documentation built on May 2, 2019, 4:54 p.m.

stringi index

Package overview

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stringi
THE string processing package for R

stri_enc_detect: [DRAFT API] Detect Character Set and Language
In stringi: THE string processing package for R

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Related to stri_enc_detect in stringi...

R Package Documentation

Browse R Packages

We want your feedback!

stringi THE string processing package for R

stri_enc_detect: [DRAFT API] Detect Character Set and Language In stringi: THE string processing package for R

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Related to stri_enc_detect in stringi...

R Package Documentation

Browse R Packages

We want your feedback!

stringi
THE string processing package for R

stri_enc_detect: [DRAFT API] Detect Character Set and Language
In stringi: THE string processing package for R