knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(xmlr)
Xmlr is a document object model of XML written in base-R. It is similar in functionality to the XML package but does not have any dependencies or require compilation. Xmlr allows you to read, write and manipulate XML.
I had problems on one of my machines to install the XML package (some gcc issue) so needed an alternative. As I have been thinking about doing something more comprehensive with Reference Classes I decided to create a pure base-R DOM model with some simple ways to do input and output to strings and files. As it turned out to be useful to me, I thought it might be for others as well, so I decided to open source and publish it.
There are 2 ways to create an XML tree, by parsing XML text or by creating it programmatically.
Xmlr supports two ways of parsing text: either from a character string or from a xml file. To parse a string you do something like:
doc <- parse.xmlstring(" <table xmlns='http://www.w3.org/TR/html4/'> <tr> <td class='fruit'>Apples</td> <td class='fruit'>Bananas</td> </tr> </table>")
...and similarly to parse a XML file you do:
doc <- parse.xmlfile("pom.xml")
To create the following xml
<table xmlns='http://www.w3.org/TR/html4/'> <tr> <td class='fruit'>Apples</td> <td class='fruit'>Bananas</td> </tr> </table>
...you could do something like this:
doc <- Document$new() root <- Element$new("table") root$setAttribute("xmlns", "http://www.w3.org/TR/html4/") doc$setRootElement(root) root$addContent( Element$new("tr") $addContent( Element$new("td")$setAttribute("class", "fruit")$setText("Apples") ) $addContent( Element$new("td")$setAttribute("class", "fruit")$setText("Bananas") ) ) # print it out just to show the content print(doc)
As XML is a hierarchical model it is very well suited for an Object based implementation. The top-level class is the Document class. It contains a reference to the root element that you get through the getRootElement()
method.
Elements contain attributes and child content (more elements or a Text node). To get the text of the second td element (tag) you would something like this:
doc$getRootElement()$getChild("tr")$getChildren()[[2]]$getText()
This is useful when you have a more complex structure, and you want to create a data.frame of a part of the xml. e.g.
groceriesDf <- NULL for ( child in doc$getRootElement()$getChild("tr")$getChildren() ) { row <- list() row[["class"]] <- child$getAttribute("class") row[["item"]] <- child$getText() if (is.null(groceriesDf)) { groceriesDf <- data.frame(row, stringsAsFactors = FALSE) } else { groceriesDf <- rbind(groceriesDf, row) } } groceriesDf
You can also move upwards in the tree by using the getParent() function which exists for all Content (Element and Text nodes).
As you saw in the example above the default print will render the object model as an XML string. Print is available for each element and will print all the content in and below this element e.g.
# print the contents of the tr element root$getChild("tr")
The xmlr object model is very simple, you basically only need to care about the Document and the Element classes. In general all Objects has a toString() function that returns the textual XML representation of the object. as.character and as.vector are also specified for each class allowing you to do stuff like:
paste("The root element is", root)
Which is the same as doing paste("The root element is", root$toString())
As mentioned before the most useful thing to do with the Document object is to set or get the rootElement. In a future version you might also be interested in processing instructions etc. but those are not supported yet.
An Element has three "interests", name, attributes and child nodes. text is handled in a special way (there is a Text class) since the form <foo>Som text here</foo>
is very common.
This set the name of the Element (the tag) e.g. to create the element e <- Element$new("foo")
or use the setName(name) function:
e <- Element$new() e$setName("foo")
To get the Element name you use the function getName()
The following functions are available for attributes:
child$getAttribute("class")
child$getAttributes()[["class"]]
or use in a loop or apply etc.root$setAttribute("xmlns", "http://www.w3.org/TR/html4/")
. Content can be either other Element(s) or Text.
For the xml
<foo> <Bar note='Some text'</Bar> <Baz note='More stuff'</Baz> </foo>
...you can do something like this:
e <- Element$new("foo")$addContent( Element$new("Bar")$setAttribute("note", "Some text") )$addContent( Element$new("Baz")$setAttribute("note", "More stuff") ) e
To retrieve content there are several ways:
As mentioned above Text is treated a bit special.
You typically use the setText(text) function to set text content e.g. to create <car>Volvo</car>
you would do:
e <- Element$new("car")$setText("Volvo")
However, for more complex cases it is possible to mix text and elements using addContent(). E.g. for the xml
<cars>Volvo<value sek='200000'></value></cars>
You can create it by creating and adding a Text node explicitly rather than using the setText function.
xml <- "<car>Volvo<value sek='200000'></value></car>" e <- Element$new("car") e$addContent(Text$new("Volvo")) e$addContent(Element$new("value")$setAttribute("sek", "200000")) stopifnot(e$toString() == xml)
To check if there is any text defined for this element you can use the function hasText()
Namespaces is handled in a very simplistic way in xmlr. You declare the namespace as an attribute, and you maintain the correct prefix "manually" by prefixing the element name e.g:
parent <- Element$new("parent") parent <- Element$new("parent")$setAttribute("xmlns:env", "http://some.namespace.com") parent$addContent(Element$new("env:child")) parent
There is a simplistic way to produce a dataframe from a XML tree: xmlrToDataFrame(element) it can be used (similar to the xmlToDataFrame in the classic XML package) as follows:
xml <- " <groceries> <item type='fruit' number='4'>Apples</item> <item type='fruit' number='2'>Bananas</item> <item type='vegetables'number='6'>Tomatoes</item> </groceries> " doc <- parse.xmlstring(xml) xmldf <- xmlrToDataFrame(doc$getRootElement()) xmldf
If you have a more complex xml structure you need to build your data frame "by hand" e.g. in a similar fashion to what was outlined in the "Navigating the tree" section.
There are some general purpose classes and functions that might be useful if you want to extend or customize xmlr:
isRc(Element$new("Hello"), "Element") isRc(Text$new("Hello"), "Element")
Stack This is a general purpose linked stack with the expected functions i.e.
Parser A general purpose, SAX like, parser that you could create your own builder for. A builder needs to have the following functions (See DomBuilder for an example):
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.