encoding: Guess and repair faulty character encoding.

Description Usage Arguments stringi Examples

Description

These functions help you respond to web pages that declare incorrect encodings. You can use guess_encoding to figure out what the real encoding is (and then supply that to the encoding argument of html), or use repair_encoding to fix character vectors after the fact.

Usage

1
2
3

Arguments

x

A character vector.

from

The encoding that the string is actually in. If NULL,

stringi

These function are wrappers around tools from the fantastic stringi package, so you'll need to make sure to have that installed.

Examples

1
2
3
4
5
6
7
8
9
# A file with bad encoding included in the package
path <- system.file("html-ex", "bad-encoding.html", package = "rvest")
x <- read_html(path)
x %>% html_nodes("p") %>% html_text()

guess_encoding(x)
# Two valid encodings, only one of which is correct
read_html(path, encoding = "ISO-8859-1") %>% html_nodes("p") %>% html_text()
read_html(path, encoding = "ISO-8859-2") %>% html_nodes("p") %>% html_text()

Example output

Loading required package: xml2
[1] "<c9>migr<U+00E9> cause c<U+00E9>l<U+00E8>bre d<U+00E9>j<U+00E0> vu."
    encoding language confidence
1 ISO-8859-1       fr       0.31
2 ISO-8859-2       ro       0.22
3    GB18030       zh       0.10
4       Big5       zh       0.10
5 ISO-8859-9       tr       0.06
6 IBM424_rtl       he       0.01
7 IBM424_ltr       he       0.01
[1] "<U+00C9>migr<U+00E9> cause c<U+00E9>l<U+00E8>bre d<U+00E9>j<U+00E0> vu."
[1] "<U+00C9>migr<U+00E9> cause c<U+00E9>l<U+010D>bre d<U+00E9>j<U+0155> vu."

rvest documentation built on May 29, 2017, 10:46 a.m.