hrbrpkghelpr::global_opts()
hrbrpkghelpr::stinking_badges()
hrbrpkghelpr::yank_title_and_description()

Partly inspired by this SO question and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

It relies on a locally included version of libtidy and works on macOS, Linux & Windows.

It also incorporates an htmlwidget to view and test XPath queries on HTML/XML content and another widget to view an XML document in a collapseable tree view.

What's Inside The Tin

hrbrpkghelpr::describe_ingredients()

Installation

hrbrpkghelpr::install_block()

Usage

library(htmltidy)

# current verison
packageVersion("htmltidy")

library(XML)
library(xml2)
library(httr)
library(purrr)

This is really "un-tidy" content:

res <- GET("https://rud.is/test/untidy.html")
cat(content(res, as="text"))

Let's see what tidy_html() does to it.

It can handle the response object directly:

cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200)))

But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy:

cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200)))

NOTE: you could also just have done:

cat(tidy_html(url("https://rud.is/test/untidy.html"), 
              list(TidyDocType="html5", TidyWrapLen=200)))

You'll see that this differs substantially from the mangling libxml2 does (via read_html()):

pg <- read_html("https://rud.is/test/untidy.html")
cat(toString(pg))

It can also deal with "raw" and parsed objects:

tidy_html(content(res, as="raw"))

tidy_html(content(res, as="text", encoding="UTF-8"))

tidy_html(content(res, as="parsed", encoding="UTF-8"))
tidy_html(suppressWarnings(htmlParse("https://rud.is/test/untidy.html")))
## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
## <html xmlns="http://www.w3.org/1999/xhtml">
## <head>
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
## <title></title>
## </head>
## <body>
## <p>https://rud.is/test/untidy.html</p>
## </body>
## </html>

And, show the markup errors:

invisible(tidy_html(url("https://rud.is/test/untidy.html"), verbose=TRUE))
## line 1 column 1 - Warning: missing <!DOCTYPE> declaration
## line 1 column 68 - Warning: nested emphasis <b>
## line 1 column 138 - Warning: missing </span> before <div>
## line 1 column 68 - Warning: missing </b> before <div>
## line 1 column 164 - Warning: inserting implicit <span>
## line 1 column 164 - Warning: missing </span>
## line 1 column 159 - Warning: missing </div>
## line 1 column 1 - Warning: inserting missing 'title' element
## line 1 column 164 - Warning: <span> anchor "sp" already defined
## Info: Document content looks like XHTML5
## Tidy found 9 warnings and 0 errors!

Testing Options

opts <- list(TidyDocType="html5",
             TidyMakeClean=TRUE,
             TidyHideComments=TRUE,
             TidyIndentContent=FALSE,
             TidyWrapLen=200)

txt <- "<html>
<head>
      <style>
        p { color: red; }
      </style>
    <body>
          <!-- ===== body ====== -->
         <p>Test</p>

    </body>
        <!--Default Zone
        -->
        <!--Default Zone End-->
</html>"

cat(tidy_html(txt, option=opts))

But, you're probably better off running it on plain HTML source.

Since it's C/C++-backed, it's pretty fast:

book <- readLines("http://singlepageappbook.com/single-page.html")
sum(map_int(book, nchar))
system.time(tidy_book <- tidy_html(book))

(It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.

htmltidy Metrics

cloc::cloc_pkg_md()

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.



hrbrmstr/htmltidy documentation built on Aug. 16, 2022, 4:39 p.m.