can_decrypt <- gargle::secret_has_key("GOOGLESHEETS4_KEY")
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  error = TRUE,
  purl = can_decrypt,
  eval = can_decrypt
)
message("No token available. Code chunks will not be evaluated.")
library(googlesheets4)

Basic Sheet reading is shown in the Get started article. Here we show how to target a specific (work)sheet or cell range, how to deal with column types, and how to get detailed cell data.

Auth

As a regular, interactive user, you can just let googlesheets4 prompt you for anything it needs re: auth.

Since this article is compiled noninteractively on a server, we have arranged for googlesheets4 to use a service account token (not shown).

# happens in .onLoad() when IN_PKGDOWN, but need this for local dev/preview
googlesheets4:::gs4_auth_docs(drive = TRUE) 

# attempt to reduce quota exhaustion problems
if (identical(Sys.getenv("IN_PKGDOWN"), "true")) Sys.sleep(30)

read_sheet() and range_read() are synonyms

The main "read" function of the googlesheets4 package goes by two names, because we want it to make sense in two contexts:

read_sheet() and range_read() are synonyms and you can use either one. Throughout this article, we're going to use range_read().

Note: The first release of googlesheets used a sheets_ prefix everywhere, so we had sheets_read(). It still works, but it's deprecated and will go away rather swiftly.

Specify the range and column types

Here we read from the "mini-gap" and "deaths" example Sheets to show some of the different ways to specify (work)sheet and cell ranges.

range_read(gs4_example("mini-gap"), sheet = 2)

range_read(gs4_example("mini-gap"), sheet = "Oceania", n_max = 3)

range_read(gs4_example("deaths"), skip = 4, n_max = 10)

The example below shows the use of range to specify both the (work)sheet and an A1-style cell range.

It also demonstrates how col_types gives control of column types, similar to how col_types works in readr and readxl. Note that currently there is only support for the "shortcode" style of column specification and we plan to align better with readr's capabilities in a future release.

range_read(
  gs4_example("deaths"), range = "other!A5:F15", col_types = "?ci??D"
)
# set deaths_hyperlink to a fallback value, so that this article doesn't
# error when can_decrypt (and therefore document-level eval) is FALSE
# this makes the inline code work, no matter what
deaths_hyperlink <- "[here](https://docs.google.com/spreadsheets/d/1VTJjWoP1nshbyxmL9JqXgdVsimaYty21LGxxs018H2Y)"
deaths_url <- rlang::with_interactive(value = FALSE,
  gs4_browse(gs4_example("deaths"))
)
deaths_hyperlink <- glue::glue("[here]({deaths_url})")

If you looked at the "deaths" spreadsheet in the browser (it's r deaths_hyperlink), you know that it has some of the typical features of real world spreadsheets: the main data rectangle has prose intended for human-consumption before and after it. That's why we have to specify the range when we read from it.

We've designated the data rectangles as named ranges, which provides a very slick way to read them -- definitely less brittle and mysterious than approaches like range = "other!A5:F15" or skip = 4, n_max = 10. A named range can be passed via the range = argument:

gs4_example("deaths") %>% 
  range_read(range = "arts_data")

The named ranges, if any exist, are part of the information returned by gs4_get().

Detailed cell data

range_read_cells() returns a data frame with one row per cell and it gives access to raw cell data sent by the Sheets API.

(df <- range_read_cells(gs4_example("deaths"), range = "E5:E7"))
df$cell[[3]]

Specify cell_data = "full", discard_empty = FALSE to get even more data if you, for example, need access to cell formulas or formatting.

spread_sheet() converts data in the "one row per cell" form into the data frame you get from range_read(), which involves reshaping and column typing.

df %>% spread_sheet(col_types = "D")
## is same as ...
range_read(gs4_example("deaths"), range = "E5:E7", col_types ="D")

When speed matters

If your Sheet is so large that the speed of range_read() is causing problems, consider range_speedread(). It uses a special URL that allows a Sheet to be read as comma-separated values (CSV). Access via this URL doesn't use the Sheets API (although range_speedread() still makes an API call to retrieve Sheet metadata). As an example, on a Sheet with around 57,000 rows and 25 columns (over 1.4 million cells), range_speedread() takes ~5 seconds, whereas range_read() takes closer to 3 minutes. Why wouldn't we always take the faster option?!? Because the speed difference is imperceptible for many Sheets and there are some downsides (described later).

range_speedread() has much the same interface as range_read().

gs4_example("gapminder") %>% 
  range_speedread(sheet = "Oceania", n_max = 3)

The output above reveals that, under the hood, range_speedread() calls an external function for CSV parsing (namely, readr::read_csv()). An important consequence is that all arguments around column type specification are passed along to the CSV-parsing function. Here is a demo using readr-style column specification:

gs4_example("deaths") %>% 
  range_speedread(
    range = "other!A5:F15",
    col_types = readr::cols(
      Age = readr::col_integer(),
      `Date of birth` = readr::col_date("%m/%d/%Y"),
      `Date of death` = readr::col_date("%m/%d/%Y")
    )
  )

Compare that to how we would read the same data with range_read():

gs4_example("deaths") %>% 
  range_read(range = "other_data", col_types = "??i?DD")

This example highlights two important differences:

What's the speed difference for something like the Africa sheet in the "gapminder" example Sheet? (around 625 rows x 6 columns, or 3700 cells)

system.time(
  gs4_example("gapminder") %>% range_speedread(sheet = "Africa")
)
system.time(
  gs4_example("gapminder") %>% range_read(sheet = "Africa")
)

The modest difference above shows that the speed difference is unlikely to be a gamechanger in many settings.

Summary of how to think about range_speedread() vs range_read():



jennybc/googlesheets2 documentation built on Dec. 10, 2023, 12:56 a.m.