library(learnr) library(tutorial.helpers) library(tidyverse) library(readxl) library(duckdb) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") df <- tibble( a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10) ) df2 <- tibble( grp = sample(1:2, 10, replace = TRUE), a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10) ) rnorm_na <- function(n, n_na, mean = 0, sd = 1) { sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na))) } df_miss <- tibble( a = rnorm_na(5, 1), b = rnorm_na(5, 1), c = rnorm_na(5, 2), d = rnorm(5) ) expand_dates <- function(df) { df |> mutate( across(where(is.Date), list(year = year, month = month, day = mday)) ) } df_date <- tibble( name = c("Amy", "Bob"), date = ymd(c("2009-08-03", "2010-01-16")) ) summarize_means <- function(df, summary_vars = where(is.numeric)) { df |> summarize( across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)), n = n() ) } long <- df |> pivot_longer(a:d) |> group_by(name) |> summarize( median = median(value), mean = mean(value)) df_paired <- tibble( a_val = rnorm(10), a_wts = runif(10), b_val = rnorm(10), b_wts = runif(10), c_val = rnorm(10), c_wts = runif(10), d_val = rnorm(10), d_wts = runif(10) ) gapminder_1952 <- readxl::read_excel("data/gapminder/1952.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/1957.xlsx") gapminder_1962 <- readxl::read_excel("data/gapminder/1962.xlsx") gapminder_1952 <- readxl::read_excel("data/gapminder/1967.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/1972.xlsx") gapminder_1962 <- readxl::read_excel("data/gapminder/1977.xlsx") gapminder_1952 <- readxl::read_excel("data/gapminder/1982.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/1987.xlsx") gapminder_1962 <- readxl::read_excel("data/gapminder/1992.xlsx") gapminder_1952 <- readxl::read_excel("data/gapminder/1997.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/2002.xlsx") gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx") paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) # Saving Multiple Outputs template <- readxl::read_excel(paths[[1]]) |> mutate(year = 1952) # Not sure why we need to specify purrr. See chapter 27 for discussion. files <- paths |> set_names(basename) |> purrr::map(readxl::read_excel) failed <- map_vec(files, is.null) # DK: Had to comment out the creation of this object since it was causing a # weird error during Github checks. Let's look at it next time. Why does this # code work with Run Document but not with R CMD check. # gapminder <- paths |> # set_names(basename) |> # map(readxl::read_excel) |> # list_rbind(names_to = "year") |> # mutate(year = parse_number(year)) process_file <- function(path) { df <- read_csv(path) df |> filter(!is.na(id)) |> mutate(id = tolower(id)) |> pivot_longer(jan:dec, names_to = "month") } df_types <- function(df) { tibble( col_name = names(df), col_type = map_chr(df, vctrs::vec_ptype_full), n_miss = map_int(df, \(x) sum(is.na(x))))} by_clarity <- diamonds |> group_nest(clarity) |> mutate(path = str_glue("diamonds-{clarity}.csv")) carat_histogram <- function(df) { ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)}
This tutorial covers Chapter 26: Iteration from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We use the purrr package to learn tools for iteration, repeatedly performing the same action on different objects. Key functions include across()
, map()
, and user-created anonymous functions.
In this section, we will learn some more general tools, often called functional programming, because they are built around functions that take other functions as inputs.
Load the tidyverse package with the library()
command.
library(...)
library(tidyverse)
Recall that we have already learnt a number of tools that perform the same action for multiple "things." For example, facet_wrap()
and facet_grid()
draw a plot for each subset of the data.
Hit "Run Code" to create the df
tibble.
df <- tibble( a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10) )
Other examples of functions which perform the same action for multiple "things" are group_by()
plus summarize()
, which compute summary statistics for each subset.
Type df
and hit "Run Code".
df
df
Our last examples of functions which perform the same action for multiple "things" are unnest_wider()
and unnest_longer()
which create new rows and columns for each element of a list-column.
Pipe df
to summarize()
.
df |> ...()
df |> summarize()
Note that this creates a tibble with one row and zero columns, because you have not told summarize to create a new variable.
Add n = n()
to the call to summarize()
.
df |> summarize(n = ...)
df |> summarize(n = n())
Recall that this create a new column named n
which records the number of rows in df
.
Add the median of a
and median of b
, med_a = median (a)
and med_b = median (b)
, to the call to summarize()
. Remember the arguments to summarize must be separated by commas.
df |> summarize(n = n(), ... = median(a), med_b = ...(b))
df |> summarize(n = n(), med_a = median(a), med_b = median(b))
Now you have a tibble with 3 columns: n
, med_a
, and med_b
.
Copy and paste the previous code, then add med_c = median(c)
and med_d = median(d)
to the call to summarize()
.
df |> summarize(n = n(), ... = ...length(), med_b = median(b), ... = ..., med_d = median(...))
df |> summarize(n = n(), med_a = median(a), med_b = median(b), med_c = median(c), med_d = median(d))
You should see a tibble with five columns: n
, med_a
, med_b
, med_c
, and med_d
.
The above process is repetitive --- we simply copy-pasted med_a = median(a)
4 times and then changed the variable name. This violates our rule of thumb to never copy and paste more than twice.
This would get even more tedious if you have tens or even hundreds of columns! Solution? across()
is here to make our lives easier.
Start a new pipe by sending df
to summarize()
. Add n = n()
to summarize()
.
df |> ...(n = n())
df |> summarize(n = n())
The across()
function is the key trick.
Add across(a:d, median)
to the call to summarize()
.
df |> summarize(n = n(), across(..., median))
df |> summarize(n = n(), across(a:d, median))
across()
makes it easy to apply the same transformation to multiple columns, allowing you to use select()
semantics inside "data-masking" functions like summarize()
and mutate()
. Note how you get the variable names a
through d
"for free," without having to specify them as when using mutate()
.
We will discuss 3 important specifications of across()
in the following sections. The first specification is .cols
, used to select columns.
Hit "Run Code".
df2 <- tibble( grp = sample(1:2, 10, replace = TRUE), a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10) )
sample(2, 10, replace = TRUE)
produces a vector of length 10, containing 1s and 2s at random.
Pipe df2
into summarize()
.
df2 |> summarize()
df2 |> summarize()
Remember that when it comes to grouping, we always prefer .by
instead of group_by()
where possible.
Add .by = grp
to the call to summarize()
df2 |> summarize(.by = grp)
df2 |> summarize(.by = grp)
Because .cols
function uses the same specifications as select()
, so you can use functions like starts_with()
and ends_with()
to select columns based on their name.
Copy the previous code. Add across(everything(), median)
to the call to summarize()
. You should have a comma between arguments to summarize.
... |> summarize(.by = ..., across(everything(), ...))
df2 |> summarize(.by = grp, across(everything(), median))
Grouping columns (grp
here) are not included in across()
, because they're automatically reserved by summarize()
. There are 2 selection techniques that are particularly useful for across()
: everything()
and where()
.
everything()
selects every (non-grouping) column.
where()
is a bit more complicated. It allows you to select columns based on their type:
where(is.numeric)
selects all numeric columns.
where(is.character)
selects all string columns.
where(is.Date)
selects all date columns.
where(is.POSIXct)
selects all date-time columns.
where(is.logical)
selects all logical columns.
You can combine these with Boolean algebra. For example:
!where(is.numeric)
selects all non-numeric columns
starts_with ("a") & where (is.logical)
selects all logical columns whose name starts with "a".
We will now explore .fns
, the second specification of across()
. This argument defines how each column will be transformed.
Hit "Run Code".
rnorm_na <- function(n, n_na, mean = 0, sd = 1) { sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na))) } df_miss <- tibble( a = rnorm_na(5, 1), b = rnorm_na(5, 1), c = rnorm_na(5, 2), d = rnorm(5) ) df_miss
We created and then used the rnorm_na()
function in order to more easily create a tibble with variables with lots of missing values.
Pipe df_miss
into summarize()
df_miss |> ...()
df_miss |> summarize()
It's important to note that we're passing the function .fns
to across()
, so across()
can call it; we're not calling it ourselves. That means the function name should never be followed by ()
. If you forget, you'll get an error.
Add across(a:d, median)
and n = n()
to the call to summarize()
.
df_miss |> summarize(across(..., ...), n = n())
df_miss |> summarize(across(a:d, median), n = n())
Several of the results are NA because median()
propagates the missing values in df_miss
. If we want to ignore missing values, then we need to use the na. rm = TRUE
argument to median()
.
Replace median
with function(x) median(x, na.rm = TRUE)
.
... |> summarize(...(a:d, function(x) median(x, na.rm = ...)), n = n())
df_miss |> summarize(across(a:d, function(x) median(x, na.rm = TRUE)), n = n())
Instead of calling median()
directly, we created a new (anonymous) function that calls median()
with the desired arguments.
Use the previous code, but replace function
with \
df_miss |> ... across(..., \(x) median(x, na.rm = TRUE)), ... )
df_miss |> summarize(across(a:d, \(x) median(x, na.rm = TRUE)), n = n())
Using the full word function
when creating temporary (anonymous) functions is somewhat verbose, hence the option to replace function
with \
.
Replace \(x) median(x, na.rm = TRUE))
with list()
.
df_miss |> summarize(...(a:d, ...), n = n())
df_miss |> summarize(across(a:d, list()), n = n())
list()
does not do anything yet. Instead, it provides a location in which you can supply multiple functions by using a named list.
Add the argument med = \(x) median(x, na.rm = TRUE)
to the call inside of list()
.
df_miss |> summarize(across(a:d, list(med = \(x) median(x, na.rm = TRUE))), n = n())
df_miss |> summarize(across(a:d, list(med = \(x) median(x, na.rm = TRUE))), n = n())
R applies the function in list()
to every variable in the first argument to across()
. But, because the argument to .fns
--- the second argument to across
--- is a list, R creates new variable names by combining the each variable with the name of the function, which is med
, not median
.
Add the argument n_miss = \(x) sum(is.na(x))
to the call inside of list()
.
df_miss |> summarize(across(a:d, list(med = \(x) median(x, na.rm = TRUE), n_miss = \(x) sum(is.na(x)))), n = n())
df_miss |> summarize(across(a:d, list(med = \(x) median(x, na.rm = TRUE), n_miss = \(x) sum(is.na(x)))), n = n())
There is a lot going on here. We have four variables: a
through d
. We have two functions: med
and n_miss
. The magic of across()
is that it applies every function to each variable, creating new names for the output columns.
The last specification of across()
is the .names
argument, which we will explore next.
Hit "Run Code".
df_miss |> summarize(across(a:d, list(med = \(x) median(x, na.rm = TRUE), n_miss = \(x) sum(is.na(x)))), n = n())
The result of across()
is named according to the specification provided in the .names
argument. The default (NULL) is equivalent to {.col}
for the single function case and {.col}_{.fn}
for the case where a list is used for .fns
.
Add .names = "{.fn}_{.col}"
to the call to across()
.
df_miss |> summarize(across(a:d, list(med = \(x) median(x, na.rm = TRUE), n_miss = \(x) sum(is.na(x))), .names = ...), n = n())
df_miss |> summarize(across(a:d, list(med = \(x) median(x, na.rm = TRUE), n_miss = \(x) sum(is.na(x))), .names = "{.fn}_{.col}"), n = n())
By default, columns are named using a glue specification like {.col}\_{.fn}
. .col
is the name of the original column (ie. a
, b
) and .fn
is the name of the function (ie. med
).
For this practice, pipe df_miss
into mutate()
.
df_miss |> mutate()
df_miss |> mutate()
The .names
argument is particularly important when you use across()
with mutate()
.
Add across(a:d, \(x) coalesce(x, 0))
to the call to mutate()
.
df_miss |> mutate(...(a:d, \(x) ...(x, 0)))
df_miss |> mutate(across(a:d, \(x) coalesce(x, 0)))
Recall that function coalesce()
replaces NAs with 0.
Note that by default, the output of across()
is given the same names as the inputs. This means that across()
inside of mutate()
will replace any existing columns with the same names. The new a
is the same as the old a
, but with any NA replaced by a zero.
To create new columns, we need to specify .names
.
Within the call to across()
, replace \(x) coalesce(x, 0)
with \(x) abs(x)
. Also, add .names = "{.col}_abs"
. Make sure to separate each argument with comma.
df_miss |> mutate( across(a:d, \(x) ...(x), ... = "{.col}_abs"))
df_miss |> mutate( across(a:d, \(x) abs(x), .names = "{.col}_abs"))
Using .names
allows you to keep the original variables and also add new ones. In this case, a_abs
is just the absolute value of a
, and so on for b
, c
, and d
.
Create a new pipe by sending df_miss
into filter()
.
df_miss |> ...()
df_miss |> filter()
across()
is a great match for summarize()
and mutate()
, but it's more awkward to use with filter()
. We will use two variants of across()
called if_any()
and if_all()
with filter()
.
Add if_any()
to the call to filter()
df_miss |> filter(...())
df_miss |> filter(if_any())
if_any()
is the same as using |
in-between the arguments. Right now, it has no effect because we haven't provided any arguments.
Add a:d
and is.na
, separated by a comma, to the function if_any()
.
df_miss |> filter(if_any(..., is.na))
df_miss |> filter(if_any(a:d, is.na))
The code above is the same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
.
The other alternative, if_all()
, is the same as using &
.
Copy and past the previous code. Change if_any
to if_all
, see what happens.
df_miss |> filter(if_all(...))
df_miss |> filter(if_all(a:d, is.na))
You will get an empty tibble because none of the rows have all the values for a
through d
equal to NA.
across()
is particularly useful to program with because it allows you to operate on multiple columns.
Hit "Run Code".
expand_dates <- function(df) { df |> mutate( across(where(is.Date), list(year = year, month = month, day = mday)) ) } df_date <- tibble( name = c("Amy", "Bob"), date = ymd(c("2009-08-03", "2010-01-16")) )
expand_dates()
uses lubridate functions to expand all date columns within a tibble into year, month, and day columns.
Pipe df_date
into expand_dates()
.
df_date |> ...()
df_date |> expand_dates()
Instead of having to specify the name of the column ahead of time --- a name which we might not even know --- we use where(is.Date)
to automatically pick out any date columns.
Hit "Run Code" to define the following function.
summarize_means <- function(df, summary_vars = where(is.numeric)) { df |> summarize( across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)), n = n() ) }
across()
makes it easy to supply multiple columns in a single argument. We would not expect you to be able to create a function like this, given the current state of your R knowledge. But you should be able to read and understand this code.
In this case, summary_vars
will be all the numeric variables in df
. The {{ }}
syntax is used for embracing the summary_vars
argument, allowing it to be passed as a variable or expression.
Pipe diamonds
into group_by(cut)
, then pipe into summarize_means()
.
diamonds |> group_by(...) |> ...()
diamonds |> group_by(cut) |> summarize_means()
We mentioned that where possible, .by
should be sued instead of group_by()
. However, this is a situation where it is not possible to use .by
.
Now, add the argument c(carat, x:z)
to the function summarize_means()
.
diamonds |> ...(cut) |> summarize_means(c(...))
diamonds |> group_by(cut) |> summarize_means(c(carat, x:z))
Another interesting function associated with across()
is pivot_longer()
, which makes data sets longer by increasing the number of rows and decreasing the number of columns.
Hit "Run Code". Pay attention to the table generated.
df |> summarize(across(a:d, list(median = median, mean = mean)))
What did you notice about this table? It's too wide and hence hard to compare values.
There's a way to compute the same values by pivoting longer and then summarizing.
Hit "Run Code" to create the following tibble.
df_paired <- tibble( a_val = rnorm(10), a_wts = runif(10), b_val = rnorm(10), b_wts = runif(10), c_val = rnorm(10), c_wts = runif(10), d_val = rnorm(10), d_wts = runif(10) )
Type in df_paired
and hit "Run Code".
df_paired
df_paired
Now you should see the tibble you have created!
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
Pipe df_paired
into function pivot_longer()
.
df_paired |> pivot_longer()
# df_paired |> # pivot_longer()
You will receive an error message because pivot_longer
requires a value for the cols
argument.
Add the argument everything()
to the function pivot_longer()
.
df_paired |> pivot_longer(...())
df_paired |> pivot_longer(everything())
Remember that the cols
argument specifies which columns to pivot and everything()
specifies that we want to pivot every column.
Add two more arguments to pivot_longer()
: names_to = c("group", ".value")
and names_sep = "_"
.
df_paired |> pivot_longer(..., names_to = c("group", ".value"), names_sep = "_")
df_paired |> pivot_longer(everything(), names_to = c("group", ".value"), names_sep = "_")
The names_sep
argument tells pivot_longer()
what separates a group
value from a .value
in each of the column names. Here, that's an underscore _
.
Here's a quick summary of the above mentioned functions:
.cols
specifies which columns you want to iterate over
.fns
specifies what to do with each column
.names
is used when you need additional control over the names of output columns, which is particularly important when you use across()
with mutate()
.
if_any()
and if_all()
are two important variations which work with filter()
.
In this section, you'll learn how to use purrr::map()
to do something to every file in a directory. Imagine you have a directory full of excel spreadsheets you want to read. You could do it with copy and paste:
data2019 <- readxl::read_excel("data/y2019.xlsx") data2020 <- readxl::read_excel("data/y2020.xlsx") data2021 <- readxl::read_excel("data/y2021.xlsx") data2022 <- readxl::read_excel("data/y2022.xlsx")
And then use dplyr::bind_rows()
to combine them all together:
data <- bind_rows(data2019, data2020, data2021, data2022)
You can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four.
Hit "Run Code".
paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) paths
paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) paths
Notice that we used function list.files()
here, which basically lists the files in a directory.
This is the code from last exercise: paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
The first argument, path
, is the directory to look in. In this case, it is "data/gapminder"
.
pattern
is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$
or [.]csv$
to find all files with a specified extension.
full.names
determines whether or not the directory name should be included in the output. You almost always want this to be TRUE
.
Based on the 12 paths created from the last exercise, call read_excel()
12 times to get 12 data frames.
Your code should look something like this:
gapminder_1952 <- readxl::read_excel("data/gapminder/1952.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/1957.xlsx") gapminder_1962 <- readxl::read_excel("data/gapminder/1962.xlsx") gapminder_1952 <- readxl::read_excel("data/gapminder/1967.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/1972.xlsx") gapminder_1962 <- readxl::read_excel("data/gapminder/1977.xlsx") gapminder_1952 <- readxl::read_excel("data/gapminder/1982.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/1987.xlsx") gapminder_1962 <- readxl::read_excel("data/gapminder/1992.xlsx") gapminder_1952 <- readxl::read_excel("data/gapminder/1997.xlsx") gapminder_1957 <- readxl::read_excel("data/gapminder/2002.xlsx") gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")
Separate excel sheets are always easier to work with once we put them into a single object. You can use list()
to combine them together, then assign the list to a new variable, files
.
Your code should look like this:
files <- list( readxl::read_excel("data/gapminder/1952.xlsx"), readxl::read_excel("data/gapminder/1957.xlsx"), readxl::read_excel("data/gapminder/1962.xlsx"), ..., readxl::read_excel("data/gapminder/2007.xlsx") )
Type files[[3]]
and hit "Run Code", see what happens. Note that we are using [[]]
here instead of ()
.
...[[3]]
files[[3]]
Remember that after you have data frames in a list, you can use files[[i]]
to extract the ith element in the list.
Type map()
in the code chunk, then hit "Run Code".
map()
The map()
functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.
Add argument paths
and readxl::read_excel
to the call to map()
.
map(paths, readxl::read_excel)
# map(paths, readxl::read_excel)
The map()
vector takes the format fo map(x, f)
, it is a way to make good use of the path
vector.
Type length()
, then hit "Run Code".
length()
# length()
This will give you an error because we haven't added any arguments yet. Function length()
can get or set the length of vectors (including lists) and factors, and of any other R object for which a method has been defined.
Add files
, the collection of tibbles we just created, to the function length()
.
length(files)
Now you know how many files are in files
.
Run the code below to see what happens.
files[[1]]
This is another data structure that doesn’t display particularly compactly with str()
so you might want to load it into RStudio and inspect it with View()
.
Type list_rbind()
in the space below. Hit "Run Code".
list_rbind()
list_rbind()
combines elements into a data frame by row-binding them together with vctrs::vec_rbind()
.
Add files
to the call to list_rbind()
.
list_rbind(files)
There are two more functions that combine list elements into a single data structure than just list_rbind()
.
list_c()
combines elements into a vector by concatenating them together with vctrs::vec_c()
.
list_rbind()
combines elements into a data frame by row-binding them together with vctrs::vec_rbind()
.
list_cbind()
combines elements into a data frame by column-binding them together with vctrs::vec_cbind()
.
Pipe paths
into the function map()
.
paths |>
map
Add \(path) readxl::read_excel(path, n_max = 1)
into the function map()
.
paths |> map(\(path) readxl::read_excel(path, n_max = 1))
n_max = 1
allows us to peak at the first few rows of the data.
Pipe the code above into function list_rbind()
.
paths |> map(\(path) readxl::read_excel(path, n_max = 1)) |> list_rbind()
This combination of map()
and list_rbind()
allows all the spreadsheets to be viewed at once and combined into one. Like what we did above with n_max = 1
, you can also customize how many rows you want to see.
Another important observation is that there's no "year" column indicating which row of data belongs to which spreadsheet.
Pipe paths
into set_names(basename)
paths |> set_names(basename)
Using basename()
in the set_names()
function, we can extract just the file name from the full path.
Pipe the previous code into map(readxl::read_excel)
paths |> set_names(basename) |> map(readxl::read_excel)
The names created in the last exercise are automatically carried along by all the map()
functions, so the list of data frames will have those same names.
Assign a new variable files
to the previous code.
file <- paths |> set_names(basename) |> map(readxl::read_excel)
What we've created above is actually shorthand for:
files <- list( "1952.xlsx" = readxl::read_excel("data/gapminder/1952.xlsx"), "1957.xlsx" = readxl::read_excel("data/gapminder/1957.xlsx"), "1962.xlsx" = readxl::read_excel("data/gapminder/1962.xlsx"), ..., "2007.xlsx" = readxl::read_excel("data/gapminder/2007.xlsx"))
Imagine how much time we would save!
Type files[[""]]
, with any year.xlsx
between the ""
files[["1962.xlsx"]]
You should get the file of the respective year you chose!
Copy and paste code from Exercise 14 to the space below. Hit "Run Code".
paths |> set_names(basename) |> map(readxl::read_excel)
Pipe the code above into list_rbind()
, then add argument names_to = "year"
to this function.
paths |> set_names(basename) |> map(readxl::read_excel) |> ...(names_to = "year")
The argument names_to
gives the name of the variable that will be created from the data stored in the column names
Pipe the previous code into mutate()
to create a new column, with the argument year = parse_number(year)
.
paths |> set_names(basename) |> map(readxl::read_excel) |> list_rbind(names_to = "year") |> mutate(year = parse_number(year))
parse_number
parses the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number. The grouping mark specified by the locale is ignored inside the number.
Overall, we used the names_to
argument to list_rbind()
to tell it to save the names into a new column called year then use readr::parse_number()
to extract the number from the string.
Assign the previous code to the variable gapminder
gapminder <- paths |> set_names(basename) |> map(readxl::read_excel) |> list_rbind(n...) |> mutate(...)
Type write_csv()
in the empty code chunk below, hit "Run Code".
write_csv()
In brief, write_csv()
is used to write a data frame to a delimited file.
Add arguments gapminder
and "gapminder.csv"
to the call to write_csv()
write_csv(gapminder, "gapminder.csv")
This is how we save the work we have previously done, a single csv file is now created and ready for future use.
Run the following code.
df <- read_csv(paths)
Pipe the newly created variable df
into filter()
, with argument !is.na(id)
.
df |> filter(!is.na(id))
is.na()
is used to deal with missing values in the dataset or data frame.
Pipe the code above into mutate()
with argument id = tolower(id)
.
df |> filter(!is.na(id)) |> mutate(...)
Then use pivot_longer()
and add arguments jan:dec
and names_to = "month"
to the call. Make sure to add a ,
between the two arguments.
df |> filter(!is.na(id)) |> mutate(id = tolower(id)) |> pivot_longer(...)
Just a recap, pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
Assign the name process_file
to function(path)
, followed by a {}
with the previous code as the body.
process_file <- function(path) { df <- read_csv(path) df |> filter(!is.na(id)) |> mutate(id = tolower(id)) |> pivot_longer(jan:dec, names_to = "month") }
In this step, we write a function that takes a file and does all the modifications (in this case, we read a bunch of files, filter out missing values, pivot, and then combine).
Pipe paths
into map()
.
paths |> map()
Add process_file
as an argument to map()
.
paths |> map(process_file)
Pipe the above into list_rbind()
.
paths |> map(process_file) |> list_rbind()
This is an example of doing one round of iteration with a complex function.
Pipe paths
into map(readxl::read_excel)
paths |> map(readxl::read_excel)
Recall that this code helps to load all files needed.
With a homogeneous data, we are able to go from map()
straight to list_rbind()
. However, when the data frames are heterogeneous, list_rbind()
either fails or yields a data frame that’s not very useful. In this case, we can create a handy df_types
function that returns a tibble with one row for each column.
Type tibble()
and hit "Run Code".
tibble()
Let's add the argument col_name
with value names(df)
to the call to tibble()
.
tibble( col_name = names(df))
Recall that this argument col_name
retrieves/sets column names.
Then, add the second argument, col_type
, then set its value to be map_chr(df, vctrs::vec_ptype_full)
.
tibble( col_name = names(df), col_type = map_chr(df, vctrs::vec_ptype_full))
col_type
specifies which types the columns in your imported data frame should have.
map_chr()
returns a character vector, which is then specified by vctrs::vec_ptype_full
.
vec_ptype_full()
displays the full type of the vector.
Add a new argument n_miss
to the previous code, with value map_int(df, \(x) sum(is.na(x)))
.
tibble( col_name = names(df), col_type = map_chr(df, vctrs::vec_ptype_full), n_miss = map_int(df, \(x) sum(is.na(x))))
To make the above code into a function, type function(df) {}
in front of tibble()
, remember to puttibble()
inside {}
.
function(df) { tibble( col_name = names(df), col_type = map_chr(df, vctrs::vec_ptype_full), n_miss = map_int(df, \(x) sum(is.na(x))))}
The previous chapter gave a detailed break down of how a new function is defined.
Assign the new function a name as df_types
.
... <- function(df) { tibble( col_name = names(df), col_type = map_chr(df, vctrs::vec_ptype_full), n_miss = map_int(df, \(x) sum(is.na(x))))}
This assignment will make life easier for us later on!
Type df_types()
and an argument gapminder
, hit "Run Code", see what happens.
df_types(gapminder)
You should have a 6x3 tibble!
There are two additional functions worth mentioning.
map_if()
allows you to selectively modify elements of a list based on their values.
map_at()
allows you to selectively modify elements based on their names.
Pipe paths
into map()
.
paths |> map()
map()
has a downside: it succeeds or fails as a whole. It will either successfully read all of the files in a directory or fail with an error, reading zero files.
Add an argument possibly()
to the call to map()
.
paths |> map(possibly())
possibly()
is a function operator: it takes a function and returns a function with modified behavior.
In particular, possibly()
changes a function from erroring to returning a value that you specify.
Add arguments \(path) readxl::read_excel(path)
and NULL
to the call to possibly()
.
paths |> map(possibly(\(path) readxl::read_excel(path), NULL))
possibly()
consists of two main values: .f
and otherwise
.
.f
is a function to modify.
otherwise
is a default value to use when an error occurs.
Assign the above code to variable files
.
... <- paths |> map(possibly(\(path) readxl::read_excel(path), NULL))
Pipe files
into list_rbind()
.
files |> list_rbind()
This works particularly well here because list_rbind()
, like many tidyverse functions, automatically ignores NULL
s.
With function map_vec()
, add arguments files
and is.null
to the call.
map_vec(files, is.null)
Assign failed
as a new variable of the previous code.
failed <- map_vec(files, is.null)
Hit "Run Code".
paths[failed]
You should see paths that failed.
Then call the import function again for each failure and figure out what went wrong.
Just a recap, there are three basic steps:
use list.files()
to list all the files in a directory, then use purrr::map()
to read each of them into a list, then use purrr::list_rbind()
to combine them into a single data frame. We'll then discuss how you can handle situations of increasing heterogeneity, where you can't do exactly the same thing to every file.
In the last section, you learned about map()
, which is useful for reading multiple files into a single object. In this and the following sections, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:
Sometimes when working with many files, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv)
. One approach to deal with this problem is to load your data into a database. Run the code in the code chunk below.
con <- DBI::dbConnect(duckdb::duckdb())
con <- DBI::dbConnect(duckdb::duckdb()) dbDisconnect(con, shutdown=TRUE)
con
is a connection to a database created by the duckdb package. Note that because of limits on learnr tutorials, we will need to run this command --- and any other commands which modify the database --- in each Exercise block for this tutorial. Apologies for the repetition!
We need to start by creating a table that we will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. Assign template
to readxl::read_excel(paths[[1]])
.
template <- readxl::...(paths[[1]])
template <- readxl::read_excel(paths[[1]])
Other databases might allow us to load Excel files directly. With duckdb, however, we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of CSVs and the database that you’re working with doesn’t have one function that will load them all in.
Take the assignment from the previous question, add a pipe, and then use mutate()
to add a new variable, year
, and assign it a value of 1952.
template <- readxl::read_excel(paths[[1]]) |> mutate(... = 1952)
template <- readxl::read_excel(paths[[1]]) |> mutate(year = 1952)
We are about to add in data from a bunch of different files, each from a different year. So, we need to ensure that there is a year
column in the template in which we can store that year information.
Run this code.
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template)
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) dbDisconnect(con, shutdown=TRUE)
Note that dbCreateTable()
doesn’t use the data in template, just the variable names and types.
Now, let's test it by piping con
into tbl("gapminder")
. (We provide the creation code. You add the pipe below it.)
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template)
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) con |> tbl("gapminder")
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) con |> tbl("gapminder") dbDisconnect(con, shutdown=TRUE)
The gapminder
table is empty now but it has the variables we need with the types we expect. You can see this explicitly if you add colnames()
to the pipe and re-run the code.
We need a function which will accept a path to a correctly formatted Excel file, read the file into a tibble, add the appropriate year variable, and then append that tibble to the gapminder
table we have created in our duckdb database.
Run this code.
append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) }
append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) }
Before creating a function like this, we often write the code which accomplishes what we want for a single case. With that script as a guide, we can create a function which is often just a "wrap" around that code.
Below the provided code, pipe paths
into map(append_file)
.
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) }
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) } paths |> map(...)
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) } # DK: This causes a failure with R CMD check. Why? # paths |> # map(append_file) dbDisconnect(con, shutdown=TRUE)
Again, it is annoying that we need to repeat so much code each Exercise.
But we don’t care about the output of append_file()
, so instead of map()
it’s slightly nicer to use walk()
. walk()
does exactly the same thing as map()
but throws the output away. Pipe paths
into walk(append_file)
.
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) }
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) } paths |> ...(append_file)
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) } paths |> walk(append_file) dbDisconnect(con, shutdown=TRUE)
Now we can check if we have all data in our table. Start a new line at the end of this code, piping con
into tbl("gapminder")
.
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) } paths |> walk(append_file)
... con |> tbl("gapminder")
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) } paths |> walk(append_file) con |> tbl("gapminder") dbDisconnect(con, shutdown=TRUE)
We have successfully loaded a dozen Excel files into a database table.
Pipe the code above into count(year)
.
... |> count(year)
con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbCreateTable(con, "gapminder", template) append_file <- function(path) { df <- readxl::read_excel(path) df$year <- parse_number(basename(path)) DBI::dbAppendTable(con, "gapminder", df) } paths |> walk(append_file) con |> tbl("gapminder") |> count(year) dbDisconnect(con, shutdown=TRUE)
The gapminder data is so small that we can just work with it in memory. But, for larger amounts of data, we need to use a database like duckdb.
The same basic principle applies if we want to write multiple CSV files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds
data and save one CSV file for each value of clarity
.
Pipe diamonds
to group_nest()
.
diamonds |> ...()
diamonds |> group_nest()
group_nest()
is used to nest a tibble using a grouping specification. The output is a tibble with as many rows as there are values for the grouping variable. Because we did not specify a grouping variable, the output is a tibble with one row.
Add the value clarity
to the call to group_nest()
.
diamonds |> group_nest(...)
diamonds |> group_nest(clarity)
The first argument to group_nest()
, .key
, is the most important. The resulting tibble has two columns. The first is the same name as the grouping variable and includes the individual values. The second is always called data
. It is a list column in which each element is a tibble with all the rows from diamonds
which have the specified value of clarity
.
Add a call to mutate()
with argument path = str_glue("diamonds-{clarity}.csv")
to the pipe.
... |> ...(path = str_glue("diamonds-{clarity}.csv"))
diamonds |> group_nest(clarity) |> mutate(path = str_glue("diamonds-{clarity}.csv"))
str_glue()
is a glue function. You now have a 8 x 3 tibble, with a new column that gives the name of an output file.
Assign the result of this pipe to by_clarity
, a new object.
... <- diamonds |> group_nest(clarity) |> mutate(path = str_glue("diamonds-{clarity}.csv"))
by_clarity <- diamonds |> group_nest(clarity) |> mutate(path = str_glue("diamonds-{clarity}.csv"))
Note how we first get some code which does what we want and then, once it is working, we assign the result to a new object. This is the typical way of doing data science because it is so useful, until the very end, for your code to just "spit out" whatever the current result is.
Type by_clarity
and hit "Run Code".
by_clarity
by_clarity
Our new tibble has eight rows and three columns.
Run the following code.
by_clarity$data[[1]]
by_clarity$data[[1]]
The $
pulls out the data
column, which is a list column. The [[1]]
pulls out the first element in that list column, which is a tibble. Remember that clarity
is our grouping variable and data
is a list-column containing one tibble for each unique value of clarity.
If we were going to save these data frames by hand, we might write something like:
write_csv(by_clarity$data[[1]], by_clarity$path[[1]]) write_csv(by_clarity$data[[2]], by_clarity$path[[2]]) write_csv(by_clarity$data[[3]], by_clarity$path[[3]]) ... write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])
But that would be annoying, especially if there were hundreds of values for clarity
. Type ?walk2
to the space below. hit "Run Code".
?walk2
?walk2
This is a little different to our previous uses of map()
because there are two arguments that are changing, not just one.
Call walk2()
with values by_clarity$data
, by_clarity$path
, and write_csv
.
walk2(..., by_clarity$path, ...)
walk2(by_clarity$data, by_clarity$path, write_csv)
We needed a new function, map2()
, which varies both the first and second arguments. Because we again don’t care about the output, we want walk2()
rather than map2()
.
Run list.files()
with the pattern
argument set to "*.csv"
.
list.files(pattern = ...)
list.files(pattern = "*.csv")
This should produce the 8 CSV files which you just created. Make sure that it does not return any files which you care about since, in the next question, we will be deleting all these files.
Always clean up the files which you create, unless you want to keep them. Pipe list.files(pattern = "*.csv")
to file.remove()
.
list.files(pattern = "*.csv") |> ...()
list.files(pattern = "*.csv") |> file.remove()
This removes all the files you created. It should return a logical vector of length 8, which each element equal to TRUE
. This is what file.remove()
returns if the removal is successful.
We can take the same basic approach in order to create many plots.
Below is a new function that draws the plot we want. Hit "Run Code".
carat_histogram <- function(df) { ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1) }
This function will only work if the tibble which you pass in as the value for the df
argument includes a column called carat
. If it doesn't, you will get an error like "object 'carat' not found".
Now we can use map()
to create a list of many plots and their eventual file paths.
Pipe by_clarity
into mutate()
. Add arguments plot = map(data, carat_histogram)
and path = str_glue("clarity-{clarity}.png")
to the function.
by_clarity |> mutate( plot = map(data, carat_histogram), path = str_glue("clarity-{clarity}.png") )
This step is similar to what we did earlier.
Assign the name by_clarity
to the code above.
by_clarity <- by_clarity |> mutate( plot = map(data, carat_histogram), path = str_glue("clarity-{clarity}.png") )
Type walk2()
then add arguments by_clarity$path
and by_clarity$plot
.
walk2( by_clarity$path, by_clarity$plot)
Add \(path, plot)
and ggsave()
to the call to walk2()
.
walk2( by_clarity$path, by_clarity$plot, \(path, plot) ggsave())
Add arguments path, plot, width = 6, height = 6
to ggsave()
.
walk2( by_clarity$path, by_clarity$plot, \(path, plot) ggsave(path, plot, width = 6, height = 6)
Remember that ggsave()
is a convenient function for saving a plot. It defaults to saving the last plot that you displayed, using the size of the current graphics device. It also guesses the type of graphics device from the extension.
This tutorial covered Chapter 26: Iteration from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We used the purrr package to learn tools for iteration, repeatedly performing the same action on different objects. Key functions included across()
, map()
, and user-created anonymous functions.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.