Spreadsheets

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(readxl)
library(writexl)
knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"


Introduction

This tutorial covers Chapter 20: Spreadsheets from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to get data from Excel spreadsheets using read_excel() from the readxl package and Google sheets using read_sheet() from the googlesheets4 package.

Reading/Writing Excel

readxl and writexl are the two key packages for working with Excel spreadsheets.

Exercise 1

Use the library() command to load the readxl package.


library(readxl)

Data Organization in Spreadsheets” by Karl Broman and Kara Woo provides lots of great advice for organizing your data using spreadsheets.

Exercise 2

Look up the help page for the package by typing, at the Console, ? followed by `readxl-package`. Copy/paste the Description field from the help page.

The reason that you need backticks around readxl-package is because the Console has trouble with - in the name of the file.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Almost all packages have a help page which can be accessed via packagename-package. Note that you need to use backticks (` `) around this phrase after the ? because otherwise the Console will try to interpret the dash as a minus sign, thereby sending you to the wrong page.

Exercise 3

Run read_excel() with "data/students.xlsx" as the argument to path. Do this, and the following questions, in the exercise code block, not at the Console.


read_excel(path = "...")

read_excel() functions similarly to read_csv() and the other data import functions you have seen before. The readxl package also includes read_xls() (for files in xls format) and read_xlsx() (for files in xlsx format). read_excel() can read files with both xls and xlsx format. It guesses the file type based on the input.

Exercise 4

The column names are a bit of a mess. You don't ever want spaces in column names, and it is always a good idea to use the same naming style in terms of capitalization and punctuation.

Run the above command again, but add c("student_id", "full_name", "favourite_food", "meal_plan", "age") as the argument to col_names.


read_excel(
  path = "data/students.xlsx",
  col_names = ...
  )

This worked, but not completely. The variable names are correct, but what was previously the header row now shows up as the first row in the resulting tibble.

Exercise 5

Run the above command again, but add the skip argument with a value of 1 because we want to skip the first row, which contains the column names, from the Excel file.


read_excel(
  path = "data/students.xlsx",
  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"), 
  skip = ...
  )

In the favourite_food column, one of the observations is "N/A", which stands for “not available” but it’s currently not recognized as an NA (note the contrast between this N/A and the age of the fourth student in the list).

Exercise 6

Rerun the command, adding the na argument with a value of c("", "N/A").


read_excel(
  path = "data/students.xlsx",
  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
  skip = 1, 
  na = ...
  )

You can specify which character strings should be recognized as NAs with the na argument. By default, only "" (empty string, or, in the case of reading from a spreadsheet, an empty cell or a cell with the formula =NA()) is recognized as an NA.

Exercise 7

Do you see anything else wrong with the data? Check the age column. It should be a numeric variable: either <dbl> or <int>. Instead, it is a character, which is caused by one of the values being "five".

To fix this, add the col_types argument with a value of c("numeric", "text", "text", "text", "numeric").


read_excel(
  path = "data/students.xlsx",
  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
  skip = 1, 
  na = c("", "N/A"), 
  col_types = ...)

Note that the acceptable values for col_types are "skip", "guess", "logical", "numeric", "date", "text" and "list".

This code "worked" in that it ran but there are two problems: first, the warning message which informs us that one of the values in column five is not numeric and, second, the value "five" is transformed into NA.

Exercise 8

Change the last value in the value we passed to col_types to "text".


read_excel(
  path = "data/students.xlsx",
  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
  skip = 1, 
  na = c("", "N/A"), 
  col_types = c("numeric", "text", "text", "text", "..."))

This removes the warning message. Never ignore warning messages. They can signal danger. Understand them, and then fix them. In this case, we read everything in as character to the age column, a common trick for dealing with weird data. We could now use mutate() and other dplyr tricks to deal with the problem directly.

Exercise 9

An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets, called worksheets. The excel_sheets() function reports all the sheets in a spreadsheet. (The term "workbooks" is often applied to speadsheet files with multiple worksheets.)

Run excel_sheets() with the path argument set to "data/penguins.xlsx".


excel_sheets(path = "...")

Once we know the name of the sheet, we can use the sheet argument to read_excel().

Exercise 10

Run read_excel() on "data/penguins.xlsx" with the sheet argument set to "Torgersen Island".


read_excel(path = ..., sheet = ...)

Note how so many variable types are <chr> even though the underlying values are clearly numeric. The cause is the "NA" values which we find in, for example, row 4. This causes read_excel() to "guess" that the columns are character instead of numeric.

Exercise 11

The problem is that the default value of the na argument is "". Rerun the code but with na set to "NA".


read_excel(
  path = "data/penguins.xlsx", 
  sheet = "Torgersen Island", 
  na = "...")

There is no way to know exactly what the data will look like until you load it and take a look at it. Well, there is one way, actually. You can open the file in Excel and take a peek. If you’re going to do so, we recommend making a copy of the Excel file to open and browse interactively while leaving the original data file untouched and reading into R from the untouched file. This will ensure you don’t accidentally overwrite anything in the spreadsheet while inspecting it.

Exercise 12

Many R packages come with data files for our use. The authors generally provide functions for accessing that data. One common approach returns the path to where that data is installed on your system.

Run readxl_example() with "deaths.xlsx" as its first argument.


readxl_example("...")

Part of the path should include readxl/extdata. The extdata directory is the recommended location for such files.

Exercise 13

Pipe the result of the previous command to read_excel().


readxl_example("deaths.xls") |> 
  read_...

The top three rows and the bottom four rows are not part of the data frame. It’s possible to eliminate these extraneous rows using the skip and n_max arguments, but we recommend using cell ranges.

Exercise 14

In Excel, the top left cell is A1. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1, C1, etc. And as you move down a column, the number in the cell label increases, i.e. A2, A3, etc.

In the previous pipe, add the argument range (with a value of "A5:F15") to the call to read_excel().


readxl_example("deaths.xls") |> 
  read_excel(range = "")

The underlying data in Excel spreadsheets is more complex. A cell can be one of four things:

When working with spreadsheet data, it’s important to keep in mind that the underlying data can be very different than what you see in the cell.

Exercise 15

To write a tibble to Excel we use write_xlsx() from the writexl package. Run the code below. Examine the resulting tibble. Then, add a pipe to write_xlsx() with the path argument set to "data/bake-sale.xlsx".

tibble(
  item     = factor(c("brownie", "cupcake", "cookie")),
  quantity = c(10, 5, 8)
)

tibble(
  item     = factor(c("brownie", "cupcake", "cookie")),
  quantity = c(10, 5, 8)
) |> 
  write_xlsx(path = "...")

If the code doesn't run, don't worry about it. If you open the file, you will see that column names are included and bolded. These can be turned off by setting the col_names and format_headers arguments to FALSE.

Google sheets

This section will also focus on spreadsheets, but this time you’ll be loading data from a Google Sheet using read_sheet() from the googlesheets4 package.

Exercise 1

Use the library() command to load the googlesheets4 package.


library(googlesheets4)

googlesheets4 uses v4 of the Google Sheets API v4 to provide an R interface to Google Sheets, hence the name.

Exercise 2

Look up the help page for the package by issuing this command: help(package = "googlesheets4"). Click on the link to the read_sheet() function. Copy/paste the Description below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

You can create a brand new sheet with gs4_create() or write to an existing sheet with sheet_write() and friends.

Exercise 3

Working with Google sheets, both those you control and those controlled by others, is often tricky. Google wants to assure that you are authorized to do what you are trying to do. The most important trick to know is that running the function gs4_deauth() often just makes things work. Run that command now.


gs4_deauth()

If you don’t need to access private Sheets, use gs4_deauth() to indicate there is no need for a token. This puts googlesheets4 into a de-authorized mode.

Exercise 4

The URL for the Google sheet which we will be working with is "https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w". First, view the sheet by clicking here. Second, pipe this URL to read_sheet().


"https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w" |> 
  read_sheet()

Data in the wild is always messy. Note the inconsistency of the variable names, the ignored "N/A" and that AGE is a list variable.

Exercise 5

Starting with the previous code, add to read_sheet() the col_names argument with a value of c("student_id", "full_name", "favourite_food", "meal_plan", "age") and the skip argument with a value of 1.


"https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w" |> 
  read_sheet(
    col_names = "...",
    skip = ...
  )

Instead of modifying the column (aka variable) names by hand, we could use the clean_names() function from the janitor package.

Exercise 6

To solve the NA and list variable problem, we make two changes to the call to read_sheet(). Add the na argument with a value of c("", "N/A") and the col_types argument with a value of "dcccc".


"https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w" |> 
  read_sheet(
    col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
    skip = 1,
    na = ...,
    ... = "dcccc"
  )

Note that we defined column types a bit differently here, using short codes. For example, “dcccc” stands for “double, character, character, character, character”. The data is still not perfect since we need to deal with the "five" value in age. But we leave that as an exercise for the reader.

Exercise 7

Google Sheets can have more than one sheet. We first define a URL for the penguins Google Sheet. Run sheet_names() on penguins_url.

penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"

sheet_names(penguins_url)

This Google sheet has the same structure as the Excel workbook we worked with earlier. Keep in mind that the nomenclature differs between Google and Excel, with the former being more confusing. The term "sheet" in Google can refer both to the overall workbook as well as to the individual pages within that workbook.

Exercise 8

It’s also possible to read individual sheets from Google Sheets. We have defined penguins_url in the environment. You can now call read_sheet() with the first argument (which is named ss) being penguins_url and the second argument sheet having a value "Torgersen Island".


read_sheet(ss = penguins_url, sheet = "Torgersen Island")

You can use write_sheet() to both create new sheets and to add new (work)sheets within a currently existing sheet. We won't practice that here because authentication issues are sometimes tricky.

Summary

This tutorial covered Chapter 20: Spreadsheets from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You have learned how to get data from Excel spreadsheets using read_excel() from the readxl package and Google sheets using read_sheet() from the googlesheets4 package.

Read “Data Organization in Spreadsheets” by Karl Broman and Kara Woo for great advice about organizing your data using spreadsheets.




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.