detect_file_enc: String encoding detection

Description Usage Arguments Value Examples

View source: R/detect.R

Description

This function tries to detect character encoding.

Usage

1

Arguments

x

Character vector, containing file names or paths.

Value

A character vector of length equal to the length of x and contains guessed iconv-compatible encodings names.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# detect character vector with ASCII strings
ascii <- "I can eat glass and it doesn't hurt me."
detect_str_enc(ascii)

# detect character vector with UTF-8 strings
utf8 <- "\u4e0b\u5348\u597d"
print(utf8)
detect_str_enc(utf8)

# function to read ASCII or UTF-8 files
read_file <- function(x) readChar(x, file.size(x))
# path to examples
ex_path <- system.file("examples", package = "uchardet")

# russian text
ru_utf8 <- read_file(file.path(ex_path, "ru.txt"))
print(ru_utf8)
detect_str_enc(iconv(ru_utf8, "utf8", "ibm866"))
detect_str_enc(iconv(ru_utf8, "utf8", "koi8-r"))
detect_str_enc(iconv(ru_utf8, "utf8", "cp1251"))

# china text
zh_utf8 <- read_file(file.path(ex_path, "zh.txt"))
print(zh_utf8)
detect_str_enc(iconv(zh_utf8, "utf8", "big5"))
detect_str_enc(iconv(zh_utf8, "utf8", "gb18030"))

# korean text
ko_utf8 <- read_file(file.path(ex_path, "ko.txt"))
print(ko_utf8)
detect_str_enc(iconv(ko_utf8, "utf8", "uhc"))
detect_str_enc(iconv(ko_utf8, "utf8", "iso-2022-kr"))

Example output

[1] "ASCII"
[1] "下午好"
[1] "UTF-8"
[1] "Я могу есть стекло, оно мне не вредит.\n"
[1] "IBM866"
[1] "KOI8-R"
[1] "WINDOWS-1251"
[1] "我能吞下玻璃而不傷身體。\n"
[1] "BIG5"
[1] "GB18030"
[1] "나는 유리를 먹을 수 있어요. 그래도 아프지 않아요\n"
[1] "UHC"
[1] "ISO-2022-KR"

uchardet documentation built on Sept. 2, 2020, 9:07 a.m.