internal-projects/07_xml-wrangling.md

Ways we have parsed XML

Jenny Bryan r format(Sys.time(), '%d %B, %Y') Internal notes on some of the ways we have taken XML from a feed and turned it into a data.frame. I'm sick of rediscovering this stuff and don't really know how to effectively search back in time in a git repo.

suppressPackageStartupMessages(library("dplyr"))
library("XML")
library("xml2")

Example data

The examples below all use a bit of XML that's a simplified snippet of what we get back from the spreadsheets feed. It's in the file spreadsheets_feed.xml.

Get out of XML fast

Early on we struggled to not shoot ourselves in the foot with XML and the ease with which a user can create a memory leak. We got skittish, threw up our hands, turned the XML into a list, and did stuff in R. For posterity's sake, here's an example of this sort of workflow.

In hindsight, this is not a great idea. It actually leads to more data wrangling, believe it or not, and is slow. That does not matter for the spreadsheets feed but does matter for the cell feed.

## turn XML into a list, using a function from the XML package, and proceed
ssf <- XML::xmlParse("spreadsheets_feed.xml")
ssf_list <- xmlToList(ssf)

## each node named 'entry' = info on an individual sheet
entries <- ssf_list[grep("entry", names(ssf_list))]

## helper function to grab all the links for an entry and turn into a one-row
## data_frame with 3 link variables
wrangle_links <- function(x) {
  links <- x[(grep("link", names(x)))]
  links <- do.call(cbind, unname(links))
  links <- setNames(links["href", , drop = TRUE],
                    links["rel", , drop = TRUE])
  names(links) <-
    gsub("http://schemas.google.com/spreadsheets/2006#worksheetsfeed",
         "ws_feed", names(links))
  as_data_frame(as.list(links))
}

## create data_frame of links for all sheets
links <- plyr::ldply(entries, wrangle_links, .id = NULL)

## create data_frame of info for all sheets
rlist_sheet_df <- dplyr::data_frame(
  sheet_title = plyr::laply(entries, function(x) x$title$text),
  sheet_key = plyr::laply(entries, `[[`, "id") %>% basename,
  owner = plyr::laply(entries, function(x) x$author$name),
  ws_feed = links$ws_feed,
  alternate = links$alternate,
  self = links$self
)
rlist_sheet_df %>% glimpse
## Observations: 2
## Variables:
## $ sheet_title (chr) "Ari's Anchor Text Scraper", "#rhizo15 #tw"
## $ sheet_key   (chr) "tQKSYVRwBXssUfYEaMdt-aw", "1oBQNnsMY8Qkuui6BAE8Tn...
## $ owner       (chr) "anahmani", "m.hawksey"
## $ ws_feed     (chr) "https://spreadsheets.google.com/feeds/worksheets/...
## $ alternate   (chr) "https://spreadsheets.google.com/ccc?key=0Av8m6X4c...
## $ self        (chr) "https://spreadsheets.google.com/feeds/spreadsheet...
#rlist_sheet_df %>% View

Use the XML package

As we got more experience with XML and XPath, in particular, we got much better at querying XML. Here's how I would do the above with the XML package today.

Main lesson: XPath is very powerful. Extremely helpful for reaching down into nooks and crannies of the XML and grabbing a specific thing (usually value or attribute) based on specific criteria (e.g. node name, attribute presence and value). If you always isolate a node set using XPath, then immediately grab the value or attribute you need, you can ignore the scary memory management stuff.

## leave as XML and use XML package
ssf <- XML::xmlParse("spreadsheets_feed.xml")

## namespace fiddliness
rig_namespace <- function(xml_node) {
  ns <- xml_node %>% xmlNamespaceDefinitions(simplify = TRUE)
  names(ns)[1] <- "feed"
  ns
}
ns <- rig_namespace(ssf)

XML_sheet_df <- dplyr::data_frame(
  sheet_title = xpathSApply(ssf, "//feed:entry//feed:title", xmlValue,
                            namespaces = ns),
  sheet_key = xpathSApply(ssf, "//feed:entry//feed:id", xmlValue,
                          namespaces = ns) %>% basename,
  owner = xpathSApply(ssf, "//feed:entry//feed:author/feed:name", xmlValue,
                      namespaces = ns),
  ws_feed =
    xpathSApply(ssf,
                "//feed:entry//feed:link[contains(@rel,'2006#worksheetsfeed')]",
                xmlGetAttr, "href", namespaces = ns),
  alternate = xpathSApply(ssf, "//feed:entry//feed:link[@rel='alternate']",
                          xmlGetAttr, "href", namespaces = ns),
  self = xpathSApply(ssf, "//feed:entry//feed:link[@rel='self']",
                          xmlGetAttr, "href", namespaces = ns)
)
XML_sheet_df %>% glimpse
## Observations: 2
## Variables:
## $ sheet_title (chr) "Ari's Anchor Text Scraper", "#rhizo15 #tw"
## $ sheet_key   (chr) "tQKSYVRwBXssUfYEaMdt-aw", "1oBQNnsMY8Qkuui6BAE8Tn...
## $ owner       (chr) "anahmani", "m.hawksey"
## $ ws_feed     (chr) "https://spreadsheets.google.com/feeds/worksheets/...
## $ alternate   (chr) "https://spreadsheets.google.com/ccc?key=0Av8m6X4c...
## $ self        (chr) "https://spreadsheets.google.com/feeds/spreadsheet...

Note that we are getting exactly the same result as above.

identical(rlist_sheet_df, XML_sheet_df)
## [1] TRUE

Some links on the XML package. There is a lot of documentation, in some sense, but it's not easy to find and navigate.

Use the xml2 package

We've been following development of the xml2 package with great interest, with the plan to switch over and return to proper XML parsing (vs. converting to an R list). Here's how we do the above using xml2.

## leave as XML and use xml2 package
ssf <- read_xml("spreadsheets_feed.xml")
ns <- xml_ns_rename(xml_ns(ssf), d1 = "feed")

xml2_sheet_df <- dplyr::data_frame(
  sheet_title =
    ssf %>% xml_find_all("//feed:entry//feed:title", ns) %>% xml_text,
  sheet_key =
    ssf %>% xml_find_all("//feed:entry//feed:id", ns) %>% xml_text %>%
    basename,
  owner =
    ssf %>% xml_find_all("//feed:entry//feed:author/feed:name", ns) %>%
    xml_text,
  ws_feed =
    ssf %>% xml_find_all(
      "//feed:entry//feed:link[contains(@rel, '2006#worksheetsfeed')]", ns) %>%
    xml_attr("href"),
  alternate =
    ssf %>% xml_find_all(
      "//feed:entry//feed:link[@rel='alternate']", ns) %>%
    xml_attr("href"),
  self =
    ssf %>% xml_find_all(
      "//feed:entry//feed:link[@rel='self']", ns) %>%
    xml_attr("href")
)
xml2_sheet_df %>% glimpse
## Observations: 2
## Variables:
## $ sheet_title (chr) "Ari's Anchor Text Scraper", "#rhizo15 #tw"
## $ sheet_key   (chr) "tQKSYVRwBXssUfYEaMdt-aw", "1oBQNnsMY8Qkuui6BAE8Tn...
## $ owner       (chr) "anahmani", "m.hawksey"
## $ ws_feed     (chr) "https://spreadsheets.google.com/feeds/worksheets/...
## $ alternate   (chr) "https://spreadsheets.google.com/ccc?key=0Av8m6X4c...
## $ self        (chr) "https://spreadsheets.google.com/feeds/spreadsheet...

Note that we are getting exactly the same result as above.

identical(xml2_sheet_df, XML_sheet_df)
## [1] TRUE

Some more links I want to park somewhere

title: "07_xml-wrangling.R" author: "jenny" date: "Tue Apr 28 14:03:35 2015"



jennybc/googlesheets documentation built on Feb. 8, 2022, 11:48 p.m.