tidyHTML | R Documentation |
This function processes an HTML document and tidys the malformed nodes so that they are legitimate TML, i.e. with end nodes (</li>, </p>) and attributes enclosed in quotes. This also corrects the HTML in various ways.
The resulting document can then be used with a more correct structure. This, for example, makes processing it with the XML parsing facilities more straightforward.
This uses libtidy from http://tidy.sourceforge.net
tidyHTML(doc, asXHTML = FALSE,
asText = inherits(doc, "AsIs") ||
(!file.exists(doc) && length(grep("\\<", doc))),
size = nchar(doc)*1.2, withErrors = FALSE)
doc |
the name of the file containing the HTML document or the contents of the HTML itself. |
asXHTML |
a logical value controlling whether the result is output as XHTML. |
asText |
a logical value indicating whether the value of |
size |
an integer scalar giving a guess of the size of the resulting tidied document |
withErrors |
a logical value controlling whether a string giving the errors in the input document are also returned |
If withErrors
is TRUE
, a list with two elements
named doc
and errors
, both of which are scalar strings.
If withErrors
is FALSE
, a character string
containing the tidied document's contents.
Duncan Temple Lang
htmlParse
doc = system.file("testData", "foo.html", package = "RTidyHTML")
tidyHTML(doc)
txt = readLines(url("http://www.omegahat.org"))
tidyHTML(txt)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.