xmlr

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(xmlr)

Xmlr is a document object model of XML written in base-R. It is similar in functionality to the XML package but does not have any dependencies or require compilation. Xmlr allows you to read, write and manipulate XML.

Why xmlr?

I had problems on one of my machines to install the XML package (some gcc issue) so needed an alternative. As I have been thinking about doing something more comprehensive with Reference Classes I decided to create a pure base-R DOM model with some simple ways to do input and output to strings and files. As it turned out to be useful to me, I thought it might be for others as well, so I decided to open source and publish it.

Creating the XML tree

There are 2 ways to create an XML tree, by parsing XML text or by creating it programmatically.

Parsing

Xmlr supports two ways of parsing text: either from a character string or from a xml file. To parse a string you do something like:

doc <- parse.xmlstring("
<table xmlns='http://www.w3.org/TR/html4/'>
    <tr>
        <td class='fruit'>Apples</td>
        <td class='fruit'>Bananas</td>
    </tr>
</table>")

...and similarly to parse a XML file you do:

doc <- parse.xmlfile("pom.xml")

Programmatic creation

To create the following xml

<table xmlns='http://www.w3.org/TR/html4/'>
    <tr>
        <td class='fruit'>Apples</td>
        <td class='fruit'>Bananas</td>
    </tr>
</table>

...you could do something like this:

doc <- Document$new()
root <- Element$new("table")
root$setAttribute("xmlns", "http://www.w3.org/TR/html4/")
doc$setRootElement(root)

root$addContent(
Element$new("tr")
  $addContent(
    Element$new("td")$setAttribute("class", "fruit")$setText("Apples")
  )
  $addContent(
    Element$new("td")$setAttribute("class", "fruit")$setText("Bananas")
  )
)
# print it out just to show the content
print(doc)

Navigating the tree

As XML is a hierarchical model it is very well suited for an Object based implementation. The top-level class is the Document class. It contains a reference to the root element that you get through the getRootElement() method.

Elements contain attributes and child content (more elements or a Text node). To get the text of the second td element (tag) you would something like this:

doc$getRootElement()$getChild("tr")$getChildren()[[2]]$getText()

This is useful when you have a more complex structure, and you want to create a data.frame of a part of the xml. e.g.

groceriesDf <- NULL
for ( child in doc$getRootElement()$getChild("tr")$getChildren() ) {
  row <- list()
  row[["class"]] <- child$getAttribute("class")
  row[["item"]] <- child$getText()
  if (is.null(groceriesDf)) {
    groceriesDf <- data.frame(row, stringsAsFactors = FALSE)
  } else {
    groceriesDf <- rbind(groceriesDf, row)
  }
}
groceriesDf

You can also move upwards in the tree by using the getParent() function which exists for all Content (Element and Text nodes).

Outputting

As you saw in the example above the default print will render the object model as an XML string. Print is available for each element and will print all the content in and below this element e.g.

# print the contents of the tr element 
root$getChild("tr")

The object model

The xmlr object model is very simple, you basically only need to care about the Document and the Element classes. In general all Objects has a toString() function that returns the textual XML representation of the object. as.character and as.vector are also specified for each class allowing you to do stuff like:

paste("The root element is", root)

Which is the same as doing paste("The root element is", root$toString())

Document

As mentioned before the most useful thing to do with the Document object is to set or get the rootElement. In a future version you might also be interested in processing instructions etc. but those are not supported yet.

Element

An Element has three "interests", name, attributes and child nodes. text is handled in a special way (there is a Text class) since the form <foo>Som text here</foo> is very common.

Name

This set the name of the Element (the tag) e.g. to create the element you can either do e <- Element$new("foo") or use the setName(name) function:

e <- Element$new()
e$setName("foo")

To get the Element name you use the function getName()

Attributes

The following functions are available for attributes:

Content

Content can be either other Element(s) or Text.

For the xml

<foo>
  <Bar note='Some text'</Bar>
  <Baz note='More stuff'</Baz>
</foo>

...you can do something like this:

e <- Element$new("foo")$addContent(
  Element$new("Bar")$setAttribute("note", "Some text")
)$addContent(
  Element$new("Baz")$setAttribute("note", "More stuff")
)
e

To retrieve content there are several ways:

Text

As mentioned above Text is treated a bit special. You typically use the setText(text) function to set text content e.g. to create <car>Volvo</car> you would do:

e <- Element$new("car")$setText("Volvo")

However, for more complex cases it is possible to mix text and elements using addContent(). E.g. for the xml <cars>Volvo<value sek='200000'></value></cars>

You can create it by creating and adding a Text node explicitly rather than using the setText function.

  xml <- "<car>Volvo<value sek='200000'></value></car>"
  e <- Element$new("car")
  e$addContent(Text$new("Volvo"))
  e$addContent(Element$new("value")$setAttribute("sek", "200000"))
  stopifnot(e$toString() == xml)

To check if there is any text defined for this element you can use the function hasText()

Namespaces

Namespaces is handled in a very simplistic way in xmlr. You declare the namespace as an attribute, and you maintain the correct prefix "manually" by prefixing the element name e.g:

parent <- Element$new("parent")
parent <- Element$new("parent")$setAttribute("xmlns:env", "http://some.namespace.com")
parent$addContent(Element$new("env:child"))
parent

Other useful stuff

There is a simplistic way to produce a dataframe from a XML tree: xmlrToDataFrame(element) it can be used (similar to the xmlToDataFrame in the classic XML package) as follows:

xml <- "
<groceries>
  <item type='fruit' number='4'>Apples</item>
  <item type='fruit' number='2'>Bananas</item>
  <item type='vegetables'number='6'>Tomatoes</item>
</groceries>
"
doc <- parse.xmlstring(xml)
xmldf <- xmlrToDataFrame(doc$getRootElement())
xmldf

If you have a more complex xml structure you need to build your data frame "by hand" e.g. in a similar fashion to what was outlined in the "Navigating the tree" section.

Misc

There are some general purpose classes and functions that might be useful if you want to extend or customize xmlr:

isRc(Element$new("Hello"), "Element")
isRc(Text$new("Hello"), "Element")


Try the xmlr package in your browser

Any scripts or data that you put into this service are public.

xmlr documentation built on July 2, 2020, 2:42 a.m.