Guess and repair faulty character encoding.

Share:

Description

These functions help you respond to web pages that declare incorrect encodings. You can use guess_encoding to figure out what the real encoding is (and then supply that to the encoding argument of html), or use repair_encoding to fix character vectors after the fact.

Usage

1
2
3

Arguments

x

A character vector.

from

The encoding that the string is actually in. If NULL,

stringi

These function are wrappers around tools from the fantastic stringi package, so you'll need to make sure to have that installed.

Examples

1
2
3
4
5
6
7
8
9
# A file with bad encoding included in the package
path <- system.file("html-ex", "bad-encoding.html", package = "rvest")
x <- read_html(path)
x %>% html_nodes("p") %>% html_text()

guess_encoding(x)
# Two valid encodings, only one of which is correct
read_html(path, encoding = "ISO-8859-1") %>% html_nodes("p") %>% html_text()
read_html(path, encoding = "ISO-8859-2") %>% html_nodes("p") %>% html_text()