hrbrpkghelpr::global_opts()
hrbrpkghelpr::stinking_badges()
hrbrpkghelpr::yank_title_and_description()
Partly inspired by this SO question and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.
It relies on a locally included version of libtidy
and works on macOS, Linux & Windows.
It also incorporates an htmlwidget
to view and test XPath queries on HTML/XML content and another widget to view an XML document in a collapseable tree view.
hrbrpkghelpr::describe_ingredients()
hrbrpkghelpr::install_block()
library(htmltidy) # current verison packageVersion("htmltidy") library(XML) library(xml2) library(httr) library(purrr)
This is really "un-tidy" content:
res <- GET("https://rud.is/test/untidy.html") cat(content(res, as="text"))
Let's see what tidy_html()
does to it.
It can handle the response
object directly:
cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200)))
But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy:
cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200)))
NOTE: you could also just have done:
cat(tidy_html(url("https://rud.is/test/untidy.html"), list(TidyDocType="html5", TidyWrapLen=200)))
You'll see that this differs substantially from the mangling libxml2
does (via read_html()
):
pg <- read_html("https://rud.is/test/untidy.html") cat(toString(pg))
It can also deal with "raw" and parsed objects:
tidy_html(content(res, as="raw")) tidy_html(content(res, as="text", encoding="UTF-8")) tidy_html(content(res, as="parsed", encoding="UTF-8"))
tidy_html(suppressWarnings(htmlParse("https://rud.is/test/untidy.html"))) ## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> ## <html xmlns="http://www.w3.org/1999/xhtml"> ## <head> ## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"> ## <title></title> ## </head> ## <body> ## <p>https://rud.is/test/untidy.html</p> ## </body> ## </html>
And, show the markup errors:
invisible(tidy_html(url("https://rud.is/test/untidy.html"), verbose=TRUE)) ## line 1 column 1 - Warning: missing <!DOCTYPE> declaration ## line 1 column 68 - Warning: nested emphasis <b> ## line 1 column 138 - Warning: missing </span> before <div> ## line 1 column 68 - Warning: missing </b> before <div> ## line 1 column 164 - Warning: inserting implicit <span> ## line 1 column 164 - Warning: missing </span> ## line 1 column 159 - Warning: missing </div> ## line 1 column 1 - Warning: inserting missing 'title' element ## line 1 column 164 - Warning: <span> anchor "sp" already defined ## Info: Document content looks like XHTML5 ## Tidy found 9 warnings and 0 errors!
opts <- list(TidyDocType="html5", TidyMakeClean=TRUE, TidyHideComments=TRUE, TidyIndentContent=FALSE, TidyWrapLen=200) txt <- "<html> <head> <style> p { color: red; } </style> <body> <!-- ===== body ====== --> <p>Test</p> </body> <!--Default Zone --> <!--Default Zone End--> </html>" cat(tidy_html(txt, option=opts))
But, you're probably better off running it on plain HTML source.
Since it's C/C++-backed, it's pretty fast:
book <- readLines("http://singlepageappbook.com/single-page.html") sum(map_int(book, nchar)) system.time(tidy_book <- tidy_html(book))
(It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.
cloc::cloc_pkg_md()
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.