suppressPackageStartupMessages(library("dplyr")) library("xml2") library("stringr") ## I want to use unexported functions here devtools::load_all(pkg = "~/research/googlesheets/") #library("googlesheets")
if(!file.exists(".httr-oauth")) { ## look for .httr-oauth in pwd (assuming pwd is googlesheets) or one level up ## (assuming pwd is googlesheets/data-for-demo pwd <- getwd() one_up <- pwd %>% dirname() HTTR_OAUTH <- c(one_up, pwd) %>% file.path(".httr-oauth") HTTR_OAUTH <- HTTR_OAUTH[HTTR_OAUTH %>% file.exists()] if(length(HTTR_OAUTH) > 0) { HTTR_OAUTH <- HTTR_OAUTH[1] file.copy(from = HTTR_OAUTH, to = ".httr-oauth", overwrite = TRUE) } }
We need better documentation of what we can get our hands on via the different feeds. Here we go.
entry
nodes are where it's atThe most useful info from the spreadsheet feed consists of the entry
nodes, one per sheet.
the_url <- "https://spreadsheets.google.com/feeds/spreadsheets/private/full" req <- gsheets_GET(the_url) ns <- xml_ns_rename(xml_ns(req$content), d1 = "feed") entries <- req$content %>% xml_find_all(".//feed:entry", ns) %>% xml_path() length(entries) req$content %>% xml_find_first(entries[1])
The entry
nodes have same structure for each sheet, which we explore via the first entry = sheet. What is all this stuff?
id
is a URL.updated
is date-time of last update (not clear exactly what that means)category
seems utterly useless to me.title/text
and content/text
both give the sheet's title. Below we confirm they are redundant.author/name
and author/email
are self-explanatory.Is the info in title/text
identical to that in content/text
?
title_stuff <- data_frame(text = req$content %>% xml_find_all(".//feed:entry//feed:title", ns) %>% xml_text(), content = req$content %>% xml_find_all(".//feed:entry//feed:content", ns) %>% xml_text()) title_stuff with(title_stuff, identical(text, content))
YES. At least for this set of sheets.
Let's set the names for entries
to the sheet titles.
names(entries) <- title_stuff$text
Each entry
node has an id
element containing a URL plus 3 additional nodes named link
. I gather all 4 into a tbl_df
for systematic exploration
jfun <- function(x) { # gymnastics required for one sheet's worth of links x <- req$content %>% xml_find_first(x) links <- x %>% xml_find_all("feed:link", ns) %>% lapply(xml_attrs) %>% lapply(as.list) %>% lapply(as_data_frame) %>% bind_rows() %>% mutate(source = "content/entry/link") links %>% rbind(data.frame(rel = NA, type = NA, href = x %>% xml_find_first("feed:id", ns) %>% xml_text(), source = "content/entry/id")) %>% mutate(feed = "ss", sheet_title = x %>% xml_find_first("feed:title", ns) %>% xml_text()) %>% select(sheet_title, feed, source, href, rel, type) } links <- entries %>% lapply(jfun) %>% bind_rows()
id
links the same?links %>% filter(rel == "self" | source == "content/entry/id") %>% group_by(sheet_title) %>% summarize(query = n_distinct(href) == 1) %>% `[[`("query") %>% all
YES, they are exactly the same, at least for these sheets.
links %>% filter(rel == "self") %>% `[[`("href") %>% str_split_fixed("//*", n = 7)
Here's what I see:
https://spreadsheets.google.com/feeds/spreadsheets/private/full/KEY
I happen to know that the worksheets feed is the link with attribute "rel"
equal to http://schemas.google.com/spreadsheets/2006#worksheetsfeed
.
links %>% filter(str_detect(rel, "2006#worksheetsfeed")) %>% `[[`("href") %>% str_split_fixed("//*", n = 7)
Here's what I see:
https://spreadsheets.google.com/feeds/worksheets/KEY/VISIBILITY/FOO
where VISIBILITY
always equals private
when URL comes from the spreadsheets feed and FOO
is values
when user has only read permission and full
when user
is also allowed to write.
Note: I arranged the rows here for clarity.
links %>% filter(rel == "alternate") %>% arrange(href) %>% `[[`("href") %>% str_split_fixed("//*", n = 6)
We've got more variety here, due to old sheets vs new. Note that, in addition to the obvious difference in domain and overall URL structure, the old sheets present an alternative key in the "alternate" link (verified explicitly below).
Here's what I see:
new Sheets: https://docs.google.com/spreadsheets/d/KEY/edit old Sheets: "https://spreadsheets.google.com/ccc?key=ALT_KEY"
id = "self" link: https://spreadsheets.google.com/feeds/spreadsheets/VISIBILITY/PROJECTION/KEY VISIBILITY = {private (always in spreadsheets feed?), public} PROJECTION = {full (default), basic} worksheets feed: https://spreadsheets.google.com/feeds/worksheets/KEY/VISIBILITY/PERMISSION VISIBILITY = {private (always in spreadsheets feed?), public} PERMISSION = {full (read and write), values (read only)} "alternate" link, new sheets: https://docs.google.com/spreadsheets/d/KEY/edit "alternate" link, old sheets: https://spreadsheets.google.com/ccc?key=ALT_KEY
Create new variables to hold info about whether a sheet is old or new and whether current user is allowed to read only or to read and write.
version_df <- links %>% filter(rel == "alternate") %>% transmute(sheet_title, version = ifelse(grepl("^https://docs.google.com/spreadsheets/d", href), "new", "old")) perm_df <- links %>% filter(str_detect(rel, "2006#worksheetsfeed")) %>% transmute(sheet_title, perm = ifelse(grepl("values", href), "r", "rw")) links <- links %>% left_join(version_df) %>% left_join(perm_df)
Store the keys in these links as a variable.
links <- links %>% mutate(link_key = extract_key_from_url(href))
Hypothesis: all link keys are uniform for a new sheet ("self" = id
agrees with worksheets feed agrees with "alternate").
links %>% filter(version == "new") %>% group_by(sheet_title) %>% summarize(query = n_distinct(link_key) == 1) %>% `[[`("query") %>% all
Hypothesis: The "self" and worksheets feed keys agree for an old sheet but differ from the "alternate" key.
links %>% filter(rel %>% str_detect("2006#worksheetsfeed|self|alternate")) %>% group_by(sheet_title) %>% summarize(query = n_distinct(link_key), version = first(version)) %>% group_by(version) %>% summarize(min = min(query), max = max(query))
This "alternate" key -- only defined for old sheets and only available through the "alternate" link found in the spreadsheets feed -- is ultimately useful to us for any operations that require the Drive API. Empirically, I note it can also be extracted from the URL seen in the browser when visiting such a sheet.
Formalize this notion of the (default) key versus the "alternate" key, which is only defined for old sheets and is damned hard to get.
alt_keys <- links %>% filter(rel == "alternate") %>% group_by(sheet_title) %>% transmute(alt_key = ifelse(version == "new", NA_character_, link_key)) sheet_keys <- links %>% filter(rel == "self") %>% group_by(sheet_title) %>% transmute(sheet_key = link_key) links <- links %>% left_join(alt_keys) %>% left_join(sheet_keys)
Note the keys shown below are truncated! Wanted to fit more variables and show that alt_key
is NA
for new Sheets and that alt_key
!= sheet_key
for old Sheets.
links %>% glimpse links %>% filter(source == "content/entry/id") %>% #arrange(version, perm, sheet_title) %>% mutate(sheet_title = substr(sheet_title, 1, 15), sheet_key = substr(sheet_key, 1, 15), alt_key = substr(alt_key, 1, 15)) %>% select(sheet_title, perm, version, sheet_key, alt_key)
Now we turn to the worksheets feed. Hand-picked 4 example sheets: all possible combinations of new vs old sheets and sheets for which I do and do not have write permission.
example_sheets <- c("unitables2010final copy", "ari copy", "WI15 ARCHY 499", "^Gapminder$") examples <- example_sheets %>% gs_ls() %>% arrange(version, perm) examples %>% select(sheet_title, version, perm)
Get the worksheets feed for each example sheet. Use sheet names to name the resulting list. Get overview of all the feeds and the first one as an example.
req_list <- examples$ws_feed %>% lapply(gsheets_GET) #names(req_list) <- substr(examples$sheet_title, 1, 12) names(req_list) <- examples$sheet_title req_list %>% str(max.level = 1) req_list[[1]] %>% str(max.level = 1) ns_ws <- xml_ns_rename(xml_ns(req_list[[1]]$content), d1 = "feed")
A worksheet feed request returns 9 components:
url
is the URL of the worksheets feed itself (this is true by definition; it's an httr
thing)status_code
, date
, times
are semi-self-explanatory and/or off-topicheaders
+ all_headers
, cookies
, request
call for some inspection (below)content
is, of course, where it's really at (next subsection)I have executed and inspected the below but it's not very interesting, nor is it related to our inventory of links. Remove chunk option eval = FALSE
if you want to bring it back.
req_list %>% lapply(`[[`, "cookies") req_list %>% lapply(`[[`, "headers") req_list %>% lapply(`[[`, "all_headers") req_list %>% lapply(`[[`, "request")
It is convenient to create a named list holding just the content.
content <- req_list %>% lapply(`[[`,"content") content %>% str(max.level = 1) #xml2::xml_structure(content[[1]]) content[[1]] content %>% lapply(xml_children) %>% lapply(length)
Interesting! There is variability in the number of nodes. What varies?
f <- . %>% xml_children %>% xml_name possible_nodes <- content %>% lapply(f) %>% unlist() %>% unique() g <- . %>% xml_children %>% xml_name %>% factor(levels = possible_nodes) %>% table
#knitr::kable(sapply(content, g)) knitr::kable(sapply(content, g), format = "html", table.attr = "style='width:30%;'")
The variation is in the multiplicity of link
and entry
elements. We pursue that below, but let's inspect the more boring components before we move on. We can predict what some of this stuff is based on what we saw in the spreadsheets feed. I'm also going to check if the info here agrees with the spreadsheets feed.
f <- function(x, xpath) xml_find_first(x, xpath, ns_ws) %>% xml_text() wsf_stuff <- data_frame(title = sapply(content, f, "feed:title"), updated = sapply(content, f, "feed:updated"), author = sapply(content, f, "feed:author//feed:name"), email = sapply(content, f, "feed:author//feed:email"), totalResults = sapply(content, f, "openSearch:totalResults"), startIndex = sapply(content, f, "openSearch:startIndex"))
knitr::kable(wsf_stuff %>% select(title, updated, author, email), format = "html", table.attr = "style='width:80%;'") #knitr::kable(wsf_stuff %>% select(title, totalResults, startIndex))
So does updated
from a sheet's worksheets feed match updated
from a sheet's entry in the spreadsheets feed?
date_stuff <- data_frame(sheet_title = examples$sheet_title, ssf_up = examples$updated %>% as.POSIXct(format = "%Y-%m-%dT%H:%M:%S", tz = "UTC"), wsf_header_last_mod = sapply(req_list, function(x) x$headers$`last-modified`) %>% httr::parse_http_date(), wsf_up = wsf_stuff$updated %>% as.POSIXct(format = "%Y-%m-%dT%H:%M:%S", tz = "UTC"), wsf_header_date = sapply(req_list, function(x) x$headers$date) %>% httr::parse_http_date())
knitr::kable(date_stuff, format = "html", table.attr = "style='width:80%;'")
Strictly "by eye" and for these examples only, I see this:
updated
is the same as the last-modified
field of the headerdate
field of the header refers to the date-time of the GET
request to the worksheets feedupdated
from the spreadsheets feed is usually equal to updated
from the worksheets feed, but not always (see ari copy
for a slight difference)Let's compare author
name between the spreadsheets and worksheets feed.
author_stuff <- data_frame(sheet_title = examples$sheet_title, ssf_author = examples$author, wsf_author_name = wsf_stuff$author)
knitr::kable(author_stuff, format = "html", table.attr = "style='width:80%;'")
They agree ... well, except for the case. Weird.
Now let's look at totalResults
and startIndex
. I already imagine that totalResults
refers to the number of worksheets and will therefore compare it to the number of entry
nodes.
more_stuff <- data_frame(sheet_title = examples$sheet_title, n_entries = content %>% lapply(xml_find_all,"feed:entry", ns_ws) %>% sapply(length), wsf_totalResults = wsf_stuff$totalResults, wsf_startIndex = wsf_stuff$startIndex)
knitr::kable(more_stuff, format = "html", table.attr = "style='width:80%;'")
YES totalResults
appears to be the number of entry
elements, which is the number of worksheets or tabs. startIndex
remains a useless mystery. From some unrelated fiddling, I think it's a feature common to various Google APIs? I wonder if it only becomes meaningful in a paginated context?
For each example sheet, we grab the id
and the link
nodes and prepare as we did all the links from the spreadsheets feed.
jfun <- function(x) { # gymnastics required for one sheet's worth of links links <- x %>% xml_find_all("feed:link", ns_ws) %>% lapply(xml_attrs) %>% lapply(as.list) %>% lapply(as_data_frame) %>% bind_rows() %>% mutate(source = "content/entry/link") links %>% rbind(data.frame(rel = NA, type = NA, href = x %>% xml_find_first("feed:id", ns_ws) %>% xml_text(), source = "content/entry/id")) %>% mutate(feed = "ws", sheet_title = x %>% xml_find_first("feed:title", ns_ws) %>% xml_text()) %>% select(sheet_title, feed, source, href, rel, type) } wsf_links <- content %>% lapply(jfun) %>% bind_rows() wsf_links_table <- wsf_links %>% count(rel, sheet_title) %>% tidyr::spread(sheet_title, n)
knitr::kable(wsf_links_table, format = "html", table.attr = "style='width:30%;'")
So we always have links with these rel
attributes: "alternate", "self", and "http://schemas.google.com/g/2005#feed". We also have a link with rel
attribute "http://schemas.google.com/g/2005#post" for all but unitables2010final copy
, which is an old sheet for which we only have read access. Note: the link with rel = NA
here is the id
link from the top-level of the worksheets feed.
What relationship do these links have to those from the top-level of the worksheets feed and from the spreadsheets feed?
identical(wsf_links$href[wsf_links$rel %in% "self"], examples$ws_feed) all.equal(req_list %>% sapply(`[[`, "url"), examples$ws_feed, check.names = FALSE)
The "self" link in the worksheets feed gives the URL of the sheet's worksheets feed (the link labelled as "http://schemas.google.com/spreadsheets/2006#worksheetsfeed" in the spreadsheets feed), which is also the url
component of the worksheets feed. Confused yet?
wsf_links$href[wsf_links$rel %in% "alternate"] == examples$alternate
It gets worse! The "alternate" link in the worksheets feed is the same as the "alternate" link in the spreadsheet feed __for new sheets_. For old sheets, these two links have the same structure but the sheet key differs. The "alternate" link from the spreadsheets feed contains what I call the alternate key, which is needed for the Google Drive API, whereas the "alternate" link from the worksheets feed uses what I just call the sheet's key. Which is what we use when talking to the Sheets API.
I can find no references in the package's current code to the worksheet feed links labelled as "http://schemas.google.com/g/2005#feed" and "http://schemas.google.com/g/2005#post", so I'm not going to delve into them.
the "worksheets feed" can be found in ... url in worksheets feed = id inside the *content* of the worksheets feed = link named "self" inside the *content* of the worksheets feed = link named "http://schemas.google.com/spreadsheets/2006#worksheetsfeed" inside the corresponding entry inside the content of the spreadsheets feed https://spreadsheets.google.com/feeds/worksheets/KEY/VISIBILITY/PERMISSION VISIBILITY = {private (default), public} PERMISSION = {full (read and write), values (read only)} the "alternate" link can be found in ... link with `rel` attribute "alternate" in the corresponding `entry` inside the content of the spreadsheets feed link with `rel` attribute "alternate" inside the content of the worksheets feed for new sheets, the "alternate" link looks like this in both places: https://docs.google.com/spreadsheets/d/KEY/edit for old sheets, the "alternate" link looks like this: https://spreadsheets.google.com/ccc?key=SOME_SORT_OF_KEY where SOME_SORT_OF_KEY = ALT_KEY in the spreadsheets feed and SOME_SORT_OF_KEY = KEY in the worksheets feed
Add some info from the spreadsheets feed to the worksheets feed links. Then row bind into one large table of links.
wsf_links <- wsf_links %>% left_join(links %>% filter(source == "content/entry/id") %>% select(sheet_title, version, perm, link_key, alt_key, sheet_key), by = "sheet_title") links <- bind_rows(links %>% filter(sheet_title %in% examples$sheet_title), wsf_links) %>% arrange(version, perm, sheet_title, feed, source, rel)
Explore.
links %>% group_by(sheet_title) %>% summarise(n = n(), ss = sum(feed == "ss"), ws = sum(feed == "ws"), ndist = n_distinct(href), version = version[1], perm = perm[1])
Among these examples, there are only 3 distinct URLs (new sheets) or 4 (old sheets). What are they?
plyr::dlply(links, ~ sheet_title + href, function(x) { x %>% select(feed, source, rel, version, perm) })
I'm struggling to make this presentable in this report. For now, just reporting what I see in these results.
For a new sheet, the three URLs are:
the "alternate" link, found in content/entry/link of ss and ws feeds: https://docs.google.com/spreadsheets/d/KEY/edit this link is never really used for anything the spreadsheets link: https://spreadsheets.google.com/feeds/spreadsheets/private/full/KEY this link is never really used for anything the worksheets feed: https://spreadsheets.google.com/feeds/worksheets/KEY/private/full this link is critical and is stored redundantly in several places feed source rel ss content/entry/link http://...2006#worksheetsfeed ws content/entry/id <NA> ws content/entry/link http://schemas.google.com/g/2005#feed ws content/entry/link http://schemas.google.com/g/2005#post ws content/entry/link self
For an old sheet, there are four URLs instead of three, because the "alternate" links in the spreadsheets and worksheets feed contain different keys:
two "alternate" links, found in content/entry/link of ss and ws feeds: spreadsheets feed: https://spreadsheets.google.com/ccc?key=ALT_KEY worksheets feed: https://spreadsheets.google.com/ccc?key=KEY we use the alternate link from the spreadsheets feed to get ALT_KEY the spreadsheets link: https://spreadsheets.google.com/feeds/spreadsheets/private/full/KEY this link is never really used for anything the worksheets feed: https://spreadsheets.google.com/feeds/worksheets/KEY/private/full this link is critical and is stored redundantly in several places (there is no 2005#post link for a real-only old sheet) feed source rel ss content/entry/link http://...2006#worksheetsfeed ws content/entry/id <NA> ws content/entry/link http://schemas.google.com/g/2005#feed ws content/entry/link http://schemas.google.com/g/2005#post ws content/entry/link self
The entry
components correspond to worksheets within the sheet. As we did with content
, we make a list with one component per spreadsheet, each containing another list of the sheet's entry
elements.
ws_entries <- content %>% lapply(xml_find_all, "feed:entry", ns_ws) ws_entries %>% lapply(length) f <- . %>% xml_children %>% xml_name possible_nodes <- ws_entries %>% lapply(f) %>% unlist() %>% unique() g <- . %>% xml_children %>% xml_name %>% factor(levels = possible_nodes) %>% table
knitr::kable(sapply(ws_entries, g), format = "html", table.attr = "style='width:30%;'")
The links are our main interest. But first let's look at the titles, colCounts, and rowCounts.
ws_entries %>% lapply(xml_find_all, "feed:title", ns_ws) %>% lapply(xml_text) ws_entries %>% lapply(xml_find_all, "gs:rowCount", ns_ws) %>% lapply(xml_text) ws_entries %>% lapply(xml_find_all, "gs:colCount", ns_ws) %>% lapply(xml_text)
Let's look at the links now. I'm going to work with the links for the first worksheet from each spreadsheet and count on the rest to have similar structure.
ws_links <- ws_entries %>% lapply(`[`, 1) %>% lapply(xml_find_all, "feed:link", ns_ws) ws_links %>% sapply(length) examples %>% select(sheet_title, perm, version)
The number of links per worksheet is maximized for a "read and write" new sheet: 6 links per worksheet. There are 5 links per worksheet in the cases of a "read only" new sheet and a "read and write" old sheet. There are only 4 links per worksheet for a "read only" old sheet.
jfun <- function(x) { # gymnastics required for one sheet's worth of links links <- x %>% xml_find_all("feed:entry", ns_ws) %>% `[`(1) %>% xml_find_all("feed:link", ns_ws) %>% lapply(xml_attrs) %>% lapply(as.list) %>% lapply(as_data_frame) %>% bind_rows() links %>% mutate(sheet_title = x %>% xml_find_first("feed:title", ns_ws) %>% xml_text()) %>% select(sheet_title, href, rel, type) } one_ws_links <- content %>% lapply(jfun) %>% bind_rows() %>% mutate(rel = rel %>% basename) one_ws_links_table <- one_ws_links %>% count(rel, sheet_title) %>% tidyr::spread(sheet_title, n)
knitr::kable(one_ws_links_table, format = "html", table.attr = "style='width:30%;'")
The old sheets (ari copy
and unitables2010final copy
) are lacking the 2006#exportcsv, a fact we know all too well. And the "read only" sheets are missing the "edit" link, which stands to reason.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.