library(learnr) library(tutorial.helpers) library(tidyverse) library(readxl) library(writexl) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"
This tutorial covers Chapter 20: Spreadsheets from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to get data from Excel spreadsheets using read_excel()
from the readxl package and Google sheets using read_sheet()
from the googlesheets4 package.
readxl and writexl are the two key packages for working with Excel spreadsheets.
Use the library()
command to load the readxl package.
library(readxl)
“Data Organization in Spreadsheets” by Karl Broman and Kara Woo provides lots of great advice for organizing your data using spreadsheets.
Look up the help page for the package by typing, at the Console, ?
followed by `readxl-package
`. Copy/paste the Description field from the help page.
The reason that you need backticks around readxl-package
is because the Console has trouble with -
in the name of the file.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Almost all packages have a help page which can be accessed via packagename-package
. Note that you need to use backticks (` `) around this phrase after the ?
because otherwise the Console will try to interpret the dash as a minus sign, thereby sending you to the wrong page.
Run read_excel()
with "data/students.xlsx"
as the argument to path
. Do this, and the following questions, in the exercise code block, not at the Console.
read_excel(path = "...")
read_excel()
functions similarly to read_csv()
and the other data import functions you have seen before. The readxl package also includes read_xls()
(for files in xls format) and read_xlsx()
(for files in xlsx format). read_excel()
can read files with both xls and xlsx format. It guesses the file type based on the input.
The column names are a bit of a mess. You don't ever want spaces in column names, and it is always a good idea to use the same naming style in terms of capitalization and punctuation.
Run the above command again, but add c("student_id", "full_name", "favourite_food", "meal_plan", "age")
as the argument to col_names
.
read_excel( path = "data/students.xlsx", col_names = ... )
This worked, but not completely. The variable names are correct, but what was previously the header row now shows up as the first row in the resulting tibble.
Run the above command again, but add the skip
argument with a value of 1
because we want to skip the first row, which contains the column names, from the Excel file.
read_excel( path = "data/students.xlsx", col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"), skip = ... )
In the favourite_food
column, one of the observations is "N/A", which stands for “not available” but it’s currently not recognized as an NA
(note the contrast between this N/A and the age of the fourth student in the list).
Rerun the command, adding the na
argument with a value of c("", "N/A")
.
read_excel( path = "data/students.xlsx", col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"), skip = 1, na = ... )
You can specify which character strings should be recognized as NA
s with the na
argument. By default, only "" (empty string, or, in the case of reading from a spreadsheet, an empty cell or a cell with the formula =NA()
) is recognized as an NA
.
Do you see anything else wrong with the data? Check the age
column. It should be a numeric variable: either <dbl>
or <int>
. Instead, it is a character, which is caused by one of the values being "five".
To fix this, add the col_types
argument with a value of c("numeric", "text", "text", "text", "numeric")
.
read_excel( path = "data/students.xlsx", col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"), skip = 1, na = c("", "N/A"), col_types = ...)
Note that the acceptable values for col_types
are "skip", "guess", "logical", "numeric", "date", "text" and "list".
This code "worked" in that it ran but there are two problems: first, the warning message which informs us that one of the values in column five is not numeric and, second, the value "five" is transformed into NA
.
Change the last value in the value we passed to col_types
to "text".
read_excel( path = "data/students.xlsx", col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"), skip = 1, na = c("", "N/A"), col_types = c("numeric", "text", "text", "text", "..."))
This removes the warning message. Never ignore warning messages. They can signal danger. Understand them, and then fix them. In this case, we read everything in as character to the age
column, a common trick for dealing with weird data. We could now use mutate()
and other dplyr tricks to deal with the problem directly.
An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets, called worksheets. The excel_sheets()
function reports all the sheets in a spreadsheet. (The term "workbooks" is often applied to speadsheet files with multiple worksheets.)
Run excel_sheets()
with the path
argument set to "data/penguins.xlsx"
.
excel_sheets(path = "...")
Once we know the name of the sheet, we can use the sheet
argument to read_excel()
.
Run read_excel()
on "data/penguins.xlsx"
with the sheet
argument set to "Torgersen Island"
.
read_excel(path = ..., sheet = ...)
Note how so many variable types are <chr>
even though the underlying values are clearly numeric. The cause is the "NA" values which we find in, for example, row 4. This causes read_excel()
to "guess" that the columns are character instead of numeric.
The problem is that the default value of the na
argument is ""
. Rerun the code but with na
set to "NA"
.
read_excel( path = "data/penguins.xlsx", sheet = "Torgersen Island", na = "...")
There is no way to know exactly what the data will look like until you load it and take a look at it. Well, there is one way, actually. You can open the file in Excel and take a peek. If you’re going to do so, we recommend making a copy of the Excel file to open and browse interactively while leaving the original data file untouched and reading into R from the untouched file. This will ensure you don’t accidentally overwrite anything in the spreadsheet while inspecting it.
Many R packages come with data files for our use. The authors generally provide functions for accessing that data. One common approach returns the path to where that data is installed on your system.
Run readxl_example()
with "deaths.xlsx"
as its first argument.
readxl_example("...")
Part of the path should include readxl/extdata
. The extdata
directory is the recommended location for such files.
Pipe the result of the previous command to read_excel()
.
readxl_example("deaths.xls") |> read_...
The top three rows and the bottom four rows are not part of the data frame. It’s possible to eliminate these extraneous rows using the skip
and n_max
arguments, but we recommend using cell ranges.
In Excel, the top left cell is A1
. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1
, C1
, etc. And as you move down a column, the number in the cell label increases, i.e. A2
, A3
, etc.
In the previous pipe, add the argument range
(with a value of "A5:F15"
) to the call to read_excel()
.
readxl_example("deaths.xls") |> read_excel(range = "")
The underlying data in Excel spreadsheets is more complex. A cell can be one of four things:
A boolean, like TRUE, FALSE, or NA.
A number, like “10” or “10.5”.
A datetime, which can also include time like “11/1/21” or “11/1/21 3:00 PM”.
A text string, like “ten”.
When working with spreadsheet data, it’s important to keep in mind that the underlying data can be very different than what you see in the cell.
To write a tibble to Excel we use write_xlsx()
from the writexl package. Run the code below. Examine the resulting tibble. Then, add a pipe to write_xlsx()
with the path
argument set to "data/bake-sale.xlsx"
.
tibble( item = factor(c("brownie", "cupcake", "cookie")), quantity = c(10, 5, 8) )
tibble( item = factor(c("brownie", "cupcake", "cookie")), quantity = c(10, 5, 8) ) |> write_xlsx(path = "...")
If the code doesn't run, don't worry about it. If you open the file, you will see that column names are included and bolded. These can be turned off by setting the col_names
and format_headers
arguments to FALSE
.
This section will also focus on spreadsheets, but this time you’ll be loading data from a Google Sheet using read_sheet()
from the googlesheets4 package.
Use the library()
command to load the googlesheets4 package.
library(googlesheets4)
googlesheets4 uses v4 of the Google Sheets API v4 to provide an R interface to Google Sheets, hence the name.
Look up the help page for the package by issuing this command: help(package = "googlesheets4")
. Click on the link to the read_sheet()
function. Copy/paste the Description below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
You can create a brand new sheet with gs4_create()
or write to an existing sheet with sheet_write()
and friends.
Working with Google sheets, both those you control and those controlled by others, is often tricky. Google wants to assure that you are authorized to do what you are trying to do. The most important trick to know is that running the function gs4_deauth()
often just makes things work. Run that command now.
gs4_deauth()
If you don’t need to access private Sheets, use gs4_deauth()
to indicate there is no need for a token. This puts googlesheets4 into a de-authorized mode.
The URL for the Google sheet which we will be working with is "https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
. First, view the sheet by clicking here. Second, pipe this URL to read_sheet()
.
"https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w" |> read_sheet()
Data in the wild is always messy. Note the inconsistency of the variable names, the ignored "N/A" and that AGE
is a list variable.
Starting with the previous code, add to read_sheet()
the col_names
argument with a value of c("student_id", "full_name", "favourite_food", "meal_plan", "age")
and the skip
argument with a value of 1
.
"https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w" |> read_sheet( col_names = "...", skip = ... )
Instead of modifying the column (aka variable) names by hand, we could use the clean_names()
function from the janitor package.
To solve the NA and list variable problem, we make two changes to the call to read_sheet()
. Add the na
argument with a value of c("", "N/A")
and the col_types
argument with a value of "dcccc"
.
"https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w" |> read_sheet( col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"), skip = 1, na = ..., ... = "dcccc" )
Note that we defined column types a bit differently here, using short codes. For example, “dcccc” stands for “double, character, character, character, character”. The data is still not perfect since we need to deal with the "five" value in age
. But we leave that as an exercise for the reader.
Google Sheets can have more than one sheet. We first define a URL for the penguins Google Sheet. Run sheet_names()
on penguins_url
.
penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"
sheet_names(penguins_url)
This Google sheet has the same structure as the Excel workbook we worked with earlier. Keep in mind that the nomenclature differs between Google and Excel, with the former being more confusing. The term "sheet" in Google can refer both to the overall workbook as well as to the individual pages within that workbook.
It’s also possible to read individual sheets from Google Sheets. We have defined penguins_url
in the environment. You can now call read_sheet()
with the first argument (which is named ss
) being penguins_url
and the second argument sheet
having a value "Torgersen Island"
.
read_sheet(ss = penguins_url, sheet = "Torgersen Island")
You can use write_sheet()
to both create new sheets and to add new (work)sheets within a currently existing sheet. We won't practice that here because authentication issues are sometimes tricky.
This tutorial covered Chapter 20: Spreadsheets from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You have learned how to get data from Excel spreadsheets using read_excel()
from the readxl package and Google sheets using read_sheet()
from the googlesheets4 package.
Read “Data Organization in Spreadsheets” by Karl Broman and Kara Woo for great advice about organizing your data using spreadsheets.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.