googlesheets: Manage Google Spreadsheets from R

Ways we have parsed XML

Jenny Bryan r format(Sys.time(), '%d %B, %Y') Internal notes on some of the ways we have taken XML from a feed and turned it into a data.frame. I'm sick of rediscovering this stuff and don't really know how to effectively search back in time in a git repo.

suppressPackageStartupMessages(library("dplyr"))
library("XML")
library("xml2")

The examples below all use a bit of XML that's a simplified snippet of what we get back from the spreadsheets feed. It's in the file spreadsheets_feed.xml.

Early on we struggled to not shoot ourselves in the foot with XML and the ease with which a user can create a memory leak. We got skittish, threw up our hands, turned the XML into a list, and did stuff in R. For posterity's sake, here's an example of this sort of workflow.

In hindsight, this is not a great idea. It actually leads to more data wrangling, believe it or not, and is slow. That does not matter for the spreadsheets feed but does matter for the cell feed.

## turn XML into a list, using a function from the XML package, and proceed
ssf <- XML::xmlParse("spreadsheets_feed.xml")
ssf_list <- xmlToList(ssf)

## each node named 'entry' = info on an individual sheet
entries <- ssf_list[grep("entry", names(ssf_list))]

## helper function to grab all the links for an entry and turn into a one-row
## data_frame with 3 link variables
wrangle_links <- function(x) {
  links <- x[(grep("link", names(x)))]
  links <- do.call(cbind, unname(links))
  links <- setNames(links["href", , drop = TRUE],
                    links["rel", , drop = TRUE])
  names(links) <-
    gsub("http://schemas.google.com/spreadsheets/2006#worksheetsfeed",
         "ws_feed", names(links))
  as_data_frame(as.list(links))
}

## create data_frame of links for all sheets
links <- plyr::ldply(entries, wrangle_links, .id = NULL)

## create data_frame of info for all sheets
rlist_sheet_df <- dplyr::data_frame(
  sheet_title = plyr::laply(entries, function(x) x$title$text),
  sheet_key = plyr::laply(entries, `[[`, "id") %>% basename,
  owner = plyr::laply(entries, function(x) x$author$name),
  ws_feed = links$ws_feed,
  alternate = links$alternate,
  self = links$self
)
rlist_sheet_df %>% glimpse

## Observations: 2
## Variables:
## $ sheet_title (chr) "Ari's Anchor Text Scraper", "#rhizo15 #tw"
## $ sheet_key   (chr) "tQKSYVRwBXssUfYEaMdt-aw", "1oBQNnsMY8Qkuui6BAE8Tn...
## $ owner       (chr) "anahmani", "m.hawksey"
## $ ws_feed     (chr) "https://spreadsheets.google.com/feeds/worksheets/...
## $ alternate   (chr) "https://spreadsheets.google.com/ccc?key=0Av8m6X4c...
## $ self        (chr) "https://spreadsheets.google.com/feeds/spreadsheet...

#rlist_sheet_df %>% View

As we got more experience with XML and XPath, in particular, we got much better at querying XML. Here's how I would do the above with the XML package today.

Main lesson: XPath is very powerful. Extremely helpful for reaching down into nooks and crannies of the XML and grabbing a specific thing (usually value or attribute) based on specific criteria (e.g. node name, attribute presence and value). If you always isolate a node set using XPath, then immediately grab the value or attribute you need, you can ignore the scary memory management stuff.

## leave as XML and use XML package
ssf <- XML::xmlParse("spreadsheets_feed.xml")

## namespace fiddliness
rig_namespace <- function(xml_node) {
  ns <- xml_node %>% xmlNamespaceDefinitions(simplify = TRUE)
  names(ns)[1] <- "feed"
  ns
}
ns <- rig_namespace(ssf)

XML_sheet_df <- dplyr::data_frame(
  sheet_title = xpathSApply(ssf, "//feed:entry//feed:title", xmlValue,
                            namespaces = ns),
  sheet_key = xpathSApply(ssf, "//feed:entry//feed:id", xmlValue,
                          namespaces = ns) %>% basename,
  owner = xpathSApply(ssf, "//feed:entry//feed:author/feed:name", xmlValue,
                      namespaces = ns),
  ws_feed =
    xpathSApply(ssf,
                "//feed:entry//feed:link[contains(@rel,'2006#worksheetsfeed')]",
                xmlGetAttr, "href", namespaces = ns),
  alternate = xpathSApply(ssf, "//feed:entry//feed:link[@rel='alternate']",
                          xmlGetAttr, "href", namespaces = ns),
  self = xpathSApply(ssf, "//feed:entry//feed:link[@rel='self']",
                          xmlGetAttr, "href", namespaces = ns)
)
XML_sheet_df %>% glimpse

## Observations: 2
## Variables:
## $ sheet_title (chr) "Ari's Anchor Text Scraper", "#rhizo15 #tw"
## $ sheet_key   (chr) "tQKSYVRwBXssUfYEaMdt-aw", "1oBQNnsMY8Qkuui6BAE8Tn...
## $ owner       (chr) "anahmani", "m.hawksey"
## $ ws_feed     (chr) "https://spreadsheets.google.com/feeds/worksheets/...
## $ alternate   (chr) "https://spreadsheets.google.com/ccc?key=0Av8m6X4c...
## $ self        (chr) "https://spreadsheets.google.com/feeds/spreadsheet...

Note that we are getting exactly the same result as above.

identical(rlist_sheet_df, XML_sheet_df)

## [1] TRUE

Some links on the XML package. There is a lot of documentation, in some sense, but it's not easy to find and navigate.

CRAN homepage
homepage for the package, as indicated on CRAN

We've been following development of the xml2 package with great interest, with the plan to switch over and return to proper XML parsing (vs. converting to an R list). Here's how we do the above using xml2.

## leave as XML and use xml2 package
ssf <- read_xml("spreadsheets_feed.xml")
ns <- xml_ns_rename(xml_ns(ssf), d1 = "feed")

xml2_sheet_df <- dplyr::data_frame(
  sheet_title =
    ssf %>% xml_find_all("//feed:entry//feed:title", ns) %>% xml_text,
  sheet_key =
    ssf %>% xml_find_all("//feed:entry//feed:id", ns) %>% xml_text %>%
    basename,
  owner =
    ssf %>% xml_find_all("//feed:entry//feed:author/feed:name", ns) %>%
    xml_text,
  ws_feed =
    ssf %>% xml_find_all(
      "//feed:entry//feed:link[contains(@rel, '2006#worksheetsfeed')]", ns) %>%
    xml_attr("href"),
  alternate =
    ssf %>% xml_find_all(
      "//feed:entry//feed:link[@rel='alternate']", ns) %>%
    xml_attr("href"),
  self =
    ssf %>% xml_find_all(
      "//feed:entry//feed:link[@rel='self']", ns) %>%
    xml_attr("href")
)
xml2_sheet_df %>% glimpse

## Observations: 2
## Variables:
## $ sheet_title (chr) "Ari's Anchor Text Scraper", "#rhizo15 #tw"
## $ sheet_key   (chr) "tQKSYVRwBXssUfYEaMdt-aw", "1oBQNnsMY8Qkuui6BAE8Tn...
## $ owner       (chr) "anahmani", "m.hawksey"
## $ ws_feed     (chr) "https://spreadsheets.google.com/feeds/worksheets/...
## $ alternate   (chr) "https://spreadsheets.google.com/ccc?key=0Av8m6X4c...
## $ self        (chr) "https://spreadsheets.google.com/feeds/spreadsheet...

Note that we are getting exactly the same result as above.

identical(xml2_sheet_df, XML_sheet_df)

## [1] TRUE

Some more links I want to park somewhere

I got some help from Hadley on XML-related wranging in this issue thread from xml2
Gaston Sanchez has a nice series of slides on dealing with data from the web.
- http://gastonsanchez.com/work/webdata/
- basics of XML and HTML -- PDF slides
- parsing XML and HTML content -- PDF slides

title: "07_xml-wrangling.R" author: "jenny" date: "Tue Apr 28 14:03:35 2015"

jennybc/googlesheets documentation built on Feb. 8, 2022, 11:48 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com