tidyHTML: Tidy HTML content

View source: R/tidy.R

tidyHTMLR Documentation

Tidy HTML content

Description

This function processes an HTML document and tidys the malformed nodes so that they are legitimate TML, i.e. with end nodes (</li>, </p>) and attributes enclosed in quotes. This also corrects the HTML in various ways.

The resulting document can then be used with a more correct structure. This, for example, makes processing it with the XML parsing facilities more straightforward.

This uses libtidy from http://tidy.sourceforge.net

Usage

tidyHTML(doc, asXHTML = FALSE, 
         asText = inherits(doc, "AsIs") ||
                    (!file.exists(doc) && length(grep("\\<", doc))),
         size = nchar(doc)*1.2, withErrors = FALSE)

Arguments

doc

the name of the file containing the HTML document or the contents of the HTML itself.

asXHTML

a logical value controlling whether the result is output as XHTML.

asText

a logical value indicating whether the value of doc is the HTML content or the name of a file.

size

an integer scalar giving a guess of the size of the resulting tidied document

withErrors

a logical value controlling whether a string giving the errors in the input document are also returned

Value

If withErrors is TRUE, a list with two elements named doc and errors, both of which are scalar strings.

If withErrors is FALSE, a character string containing the tidied document's contents.

Author(s)

Duncan Temple Lang

References

http://tidy.sourceforge.net

See Also

htmlParse

Examples

 doc = system.file("testData", "foo.html", package = "RTidyHTML")
 tidyHTML(doc)

 txt = readLines(url("http://www.omegahat.org"))
 tidyHTML(txt)

omegahat/RTidyHTML documentation built on Nov. 29, 2023, 12:42 a.m.