Iteration
In r4ds.tutorials: Tutorials for "R for Data Science"

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(readxl)
library(duckdb)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df2 <- tibble(
  grp = sample(1:2, 10, replace = TRUE),
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
  sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))
}

df_miss <- tibble(
  a = rnorm_na(5, 1),
  b = rnorm_na(5, 1),
  c = rnorm_na(5, 2),
  d = rnorm(5)
)

expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

df_date <- tibble(
  name = c("Amy", "Bob"),
  date = ymd(c("2009-08-03", "2010-01-16"))
)

summarize_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarize(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n()
    )
}

long <- df |> 
  pivot_longer(a:d) |> 
  group_by(name) |> 
  summarize(
    median = median(value),
    mean = mean(value))

df_paired <- tibble(
  a_val = rnorm(10),
  a_wts = runif(10),
  b_val = rnorm(10),
  b_wts = runif(10),
  c_val = rnorm(10),
  c_wts = runif(10),
  d_val = rnorm(10),
  d_wts = runif(10)
)



gapminder_1952 <- readxl::read_excel("data/gapminder/1952.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/1957.xlsx")
gapminder_1962 <- readxl::read_excel("data/gapminder/1962.xlsx")
gapminder_1952 <- readxl::read_excel("data/gapminder/1967.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/1972.xlsx")
gapminder_1962 <- readxl::read_excel("data/gapminder/1977.xlsx")
gapminder_1952 <- readxl::read_excel("data/gapminder/1982.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/1987.xlsx")
gapminder_1962 <- readxl::read_excel("data/gapminder/1992.xlsx")
gapminder_1952 <- readxl::read_excel("data/gapminder/1997.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/2002.xlsx")
gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")

paths <- list.files("data/gapminder", 
                    pattern = "[.]xlsx$", 
                    full.names = TRUE)

# Saving Multiple Outputs
template <- readxl::read_excel(paths[[1]]) |> 
  mutate(year = 1952)


# Not sure why we need to specify purrr. See chapter 27 for discussion.

files <- paths |>
  set_names(basename) |>
  purrr::map(readxl::read_excel)

failed <- map_vec(files, is.null)

# DK: Had to comment out the creation of this object since it was causing a
# weird error during Github checks. Let's look at it next time. Why does this
# code work with Run Document but not with R CMD check.

# gapminder <- paths |>
#   set_names(basename) |>
#   map(readxl::read_excel) |>
#   list_rbind(names_to = "year") |>
#   mutate(year = parse_number(year))

process_file <- function(path) {
  df <- read_csv(path)

  df |> 
    filter(!is.na(id)) |> 
    mutate(id = tolower(id)) |> 
    pivot_longer(jan:dec, names_to = "month")
}

df_types <- function(df) {
  tibble(
    col_name = names(df), 
    col_type = map_chr(df, vctrs::vec_ptype_full),
    n_miss = map_int(df, \(x) sum(is.na(x))))}

by_clarity <- 
  diamonds |> 
    group_nest(clarity) |> 
    mutate(path = str_glue("diamonds-{clarity}.csv"))

carat_histogram <- function(df) {
  ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)}

Introduction

This tutorial covers Chapter 26: Iteration from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We use the purrr package to learn tools for iteration, repeatedly performing the same action on different objects. Key functions include across(), map(), and user-created anonymous functions.

Modifying multiple columns

In this section, we will learn some more general tools, often called functional programming, because they are built around functions that take other functions as inputs.

Exercise 1

Load the tidyverse package with the library() command.

library(...)

library(tidyverse)

Recall that we have already learnt a number of tools that perform the same action for multiple "things." For example, facet_wrap() and facet_grid() draw a plot for each subset of the data.

Exercise 2

Hit "Run Code" to create the df tibble.

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

Other examples of functions which perform the same action for multiple "things" are group_by() plus summarize(), which compute summary statistics for each subset.

Exercise 3

Type df and hit "Run Code".

df

df

Our last examples of functions which perform the same action for multiple "things" are unnest_wider() and unnest_longer() which create new rows and columns for each element of a list-column.

Exercise 4

Pipe df to summarize().

df |> 
  ...()

df |> 
  summarize()

Note that this creates a tibble with one row and zero columns, because you have not told summarize to create a new variable.

Exercise 5

Add n = n() to the call to summarize().

df |> 
  summarize(n = ...)

df |> 
  summarize(n = n())

Recall that this create a new column named n which records the number of rows in df.

Exercise 6

Add the median of a and median of b, med_a = median (a) and med_b = median (b), to the call to summarize(). Remember the arguments to summarize must be separated by commas.

df |> 
  summarize(n = n(),
            ... = median(a),
            med_b = ...(b))

df |> 
  summarize(n = n(),
            med_a = median(a),
            med_b = median(b))

Now you have a tibble with 3 columns: n, med_a, and med_b.

Exercise 7

Copy and paste the previous code, then add med_c = median(c) and med_d = median(d) to the call to summarize().

df |> 
  summarize(n = n(),
            ... = ...length(),
            med_b = median(b),
            ... = ...,
            med_d = median(...))

df |> 
  summarize(n = n(),
            med_a = median(a),
            med_b = median(b),
            med_c = median(c),
            med_d = median(d))

You should see a tibble with five columns: n, med_a, med_b, med_c, and med_d.

The above process is repetitive --- we simply copy-pasted med_a = median(a) 4 times and then changed the variable name. This violates our rule of thumb to never copy and paste more than twice.

This would get even more tedious if you have tens or even hundreds of columns! Solution? across() is here to make our lives easier.

Exercise 8

Start a new pipe by sending df to summarize(). Add n = n() to summarize().

df |> 
  ...(n = n())

df |> 
  summarize(n = n())

The across() function is the key trick.

Exercise 9

Add across(a:d, median) to the call to summarize().

df |> 
  summarize(n = n(),
            across(..., median))

df |> 
  summarize(n = n(),
            across(a:d, median))

across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside "data-masking" functions like summarize() and mutate(). Note how you get the variable names a through d "for free," without having to specify them as when using mutate().

Selecting columns with .cols

We will discuss 3 important specifications of across() in the following sections. The first specification is .cols, used to select columns.

Exercise 1

Hit "Run Code".

df2 <- tibble(
  grp = sample(1:2, 10, replace = TRUE),
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

sample(2, 10, replace = TRUE) produces a vector of length 10, containing 1s and 2s at random.

Exercise 2

Pipe df2 into summarize().

df2 |> 
  summarize()

df2 |> 
  summarize()

Remember that when it comes to grouping, we always prefer .by instead of group_by() where possible.

Exercise 3

Add .by = grp to the call to summarize()

df2 |> 
  summarize(.by = grp)

df2 |> 
  summarize(.by = grp)

Because .cols function uses the same specifications as select(), so you can use functions like starts_with() and ends_with() to select columns based on their name.

Exercise 4

Copy the previous code. Add across(everything(), median) to the call to summarize(). You should have a comma between arguments to summarize.

... |> 
  summarize(.by = ...,
             across(everything(), ...))

df2 |> 
  summarize(.by = grp,
             across(everything(), median))

Grouping columns (grp here) are not included in across(), because they're automatically reserved by summarize(). There are 2 selection techniques that are particularly useful for across(): everything() and where().

everything() selects every (non-grouping) column.

where() is a bit more complicated. It allows you to select columns based on their type:

where(is.numeric) selects all numeric columns.

where(is.character) selects all string columns.

where(is.Date) selects all date columns.

where(is.POSIXct) selects all date-time columns.

where(is.logical) selects all logical columns.

You can combine these with Boolean algebra. For example:

!where(is.numeric) selects all non-numeric columns

starts_with ("a") & where (is.logical) selects all logical columns whose name starts with "a".

Exercise 5

We will now explore .fns, the second specification of across(). This argument defines how each column will be transformed.

Hit "Run Code".

rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
  sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))
}

df_miss <- tibble(
  a = rnorm_na(5, 1),
  b = rnorm_na(5, 1),
  c = rnorm_na(5, 2),
  d = rnorm(5)
)

df_miss

We created and then used the rnorm_na() function in order to more easily create a tibble with variables with lots of missing values.

Exercise 6

Pipe df_miss into summarize()

df_miss |> 
  ...()

df_miss |> 
  summarize()

It's important to note that we're passing the function .fns to across(), so across() can call it; we're not calling it ourselves. That means the function name should never be followed by (). If you forget, you'll get an error.

Exercise 7

Add across(a:d, median) and n = n() to the call to summarize().

df_miss |> 
  summarize(across(..., ...), 
             n = n())

df_miss |> 
  summarize(across(a:d, median), 
             n = n())

Several of the results are NA because median() propagates the missing values in df_miss. If we want to ignore missing values, then we need to use the na. rm = TRUE argument to median().

Exercise 8

Replace median with function(x) median(x, na.rm = TRUE).

... |> 
  summarize(...(a:d, 
                   function(x) median(x, na.rm = ...)),
            n = n())

df_miss |> 
  summarize(across(a:d, 
                   function(x) median(x, na.rm = TRUE)),
            n = n())

Instead of calling median() directly, we created a new (anonymous) function that calls median() with the desired arguments.

Exercise 9

Use the previous code, but replace function with \

df_miss |> 
  ...
    across(..., \(x) median(x, na.rm = TRUE)),
    ...
  )

df_miss |> 
  summarize(across(a:d, 
                   \(x) median(x, na.rm = TRUE)),
            n = n())

Using the full word function when creating temporary (anonymous) functions is somewhat verbose, hence the option to replace function with \.

Exercise 10

Replace \(x) median(x, na.rm = TRUE)) with list().

df_miss |> 
  summarize(...(a:d, ...),
            n = n())

df_miss |> 
  summarize(across(a:d, list()),
            n = n())

list() does not do anything yet. Instead, it provides a location in which you can supply multiple functions by using a named list.

Exercise 11

Add the argument med = \(x) median(x, na.rm = TRUE) to the call inside of list().

df_miss |> 
  summarize(across(a:d, 
                   list(med = \(x) median(x, na.rm = TRUE))),
             n = n())

df_miss |> 
  summarize(across(a:d, 
                   list(med = \(x) median(x, na.rm = TRUE))),
             n = n())

R applies the function in list() to every variable in the first argument to across(). But, because the argument to .fns --- the second argument to across --- is a list, R creates new variable names by combining the each variable with the name of the function, which is med, not median.

Exercise 12

Add the argument n_miss = \(x) sum(is.na(x)) to the call inside of list().

df_miss |> 
  summarize(across(a:d, 
                     list(med = \(x) median(x, na.rm = TRUE),
                          n_miss = \(x) sum(is.na(x)))),
             n = n())

df_miss |> 
  summarize(across(a:d, 
                   list(med = \(x) median(x, na.rm = TRUE),
                        n_miss = \(x) sum(is.na(x)))),
             n = n())

There is a lot going on here. We have four variables: a through d. We have two functions: med and n_miss. The magic of across() is that it applies every function to each variable, creating new names for the output columns.

Column names

The last specification of across() is the .names argument, which we will explore next.

Exercise 1

Hit "Run Code".

df_miss |> 
  summarize(across(a:d, 
                   list(med = \(x) median(x, na.rm = TRUE),
                        n_miss = \(x) sum(is.na(x)))),
             n = n())

The result of across() is named according to the specification provided in the .names argument. The default (NULL) is equivalent to {.col} for the single function case and {.col}_{.fn} for the case where a list is used for .fns.

Exercise 2

Add .names = "{.fn}_{.col}" to the call to across().

df_miss |> 
  summarize(across(a:d, 
                   list(med = \(x) median(x, na.rm = TRUE),
                        n_miss = \(x) sum(is.na(x))),
                    .names = ...),
             n = n())

df_miss |> 
  summarize(across(a:d, 
                   list(med = \(x) median(x, na.rm = TRUE),
                        n_miss = \(x) sum(is.na(x))),
                     .names = "{.fn}_{.col}"),
             n = n())

By default, columns are named using a glue specification like {.col}\_{.fn}. .col is the name of the original column (ie. a, b) and .fn is the name of the function (ie. med).

Exercise 3

For this practice, pipe df_miss into mutate().

df_miss |> 
  mutate()

df_miss |> 
  mutate()

The .names argument is particularly important when you use across() with mutate().

Exercise 4

Add across(a:d, \(x) coalesce(x, 0)) to the call to mutate().

df_miss |> 
  mutate(...(a:d, 
                \(x) ...(x, 0)))

df_miss |> 
  mutate(across(a:d, 
                \(x) coalesce(x, 0)))

Recall that function coalesce() replaces NAs with 0.

Note that by default, the output of across() is given the same names as the inputs. This means that across() inside of mutate() will replace any existing columns with the same names. The new a is the same as the old a, but with any NA replaced by a zero.

To create new columns, we need to specify .names.

Exercise 5

Within the call to across(), replace \(x) coalesce(x, 0) with \(x) abs(x). Also, add .names = "{.col}_abs". Make sure to separate each argument with comma.

df_miss |> 
  mutate(
    across(a:d, \(x) ...(x), 
           ... = "{.col}_abs"))

df_miss |> 
  mutate(
    across(a:d, \(x) abs(x), 
           .names = "{.col}_abs"))

Using .names allows you to keep the original variables and also add new ones. In this case, a_abs is just the absolute value of a, and so on for b, c, and d.

Exercise 6

Create a new pipe by sending df_miss into filter().

df_miss |> 
  ...()

df_miss |> 
  filter()

across() is a great match for summarize() and mutate(), but it's more awkward to use with filter(). We will use two variants of across() called if_any() and if_all() with filter().

Exercise 7

Add if_any() to the call to filter()

df_miss |> 
  filter(...())

df_miss |> 
  filter(if_any())

if_any() is the same as using | in-between the arguments. Right now, it has no effect because we haven't provided any arguments.

Exercise 8

Add a:d and is.na, separated by a comma, to the function if_any().

df_miss |> 
  filter(if_any(..., is.na))

df_miss |> 
  filter(if_any(a:d, is.na))

The code above is the same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d)).

Exercise 9

The other alternative, if_all(), is the same as using &.

Copy and past the previous code. Change if_any to if_all, see what happens.

df_miss |> 
  filter(if_all(...))

df_miss |> 
  filter(if_all(a:d, is.na))

You will get an empty tibble because none of the rows have all the values for a through d equal to NA.

across() in functions

across() is particularly useful to program with because it allows you to operate on multiple columns.

Exercise 1

Hit "Run Code".

expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

df_date <- tibble(
  name = c("Amy", "Bob"),
  date = ymd(c("2009-08-03", "2010-01-16"))
)

expand_dates() uses lubridate functions to expand all date columns within a tibble into year, month, and day columns.

Exercise 2

Pipe df_date into expand_dates().

df_date |> 
  ...()

df_date |> 
  expand_dates()

Instead of having to specify the name of the column ahead of time --- a name which we might not even know --- we use where(is.Date) to automatically pick out any date columns.

Exercise 3

Hit "Run Code" to define the following function.

summarize_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarize(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n()
    )
}

across() makes it easy to supply multiple columns in a single argument. We would not expect you to be able to create a function like this, given the current state of your R knowledge. But you should be able to read and understand this code.

In this case, summary_vars will be all the numeric variables in df. The {{ }} syntax is used for embracing the summary_vars argument, allowing it to be passed as a variable or expression.

Exercise 4

Pipe diamonds into group_by(cut), then pipe into summarize_means().

diamonds |> 
  group_by(...) |> 
  ...()

diamonds |> 
  group_by(cut) |> 
  summarize_means()

We mentioned that where possible, .by should be sued instead of group_by(). However, this is a situation where it is not possible to use .by.

Exercise 5

Now, add the argument c(carat, x:z) to the function summarize_means().

diamonds |> 
  ...(cut) |> 
  summarize_means(c(...))

diamonds |> 
  group_by(cut) |> 
  summarize_means(c(carat, x:z))

Another interesting function associated with across() is pivot_longer(), which makes data sets longer by increasing the number of rows and decreasing the number of columns.

Exercise 6

Hit "Run Code". Pay attention to the table generated.

df |> 
  summarize(across(a:d, 
                   list(median = median, mean = mean)))

What did you notice about this table? It's too wide and hence hard to compare values.

There's a way to compute the same values by pivoting longer and then summarizing.

Exercise 7

Hit "Run Code" to create the following tibble.

df_paired <- tibble(
  a_val = rnorm(10),
  a_wts = runif(10),
  b_val = rnorm(10),
  b_wts = runif(10),
  c_val = rnorm(10),
  c_wts = runif(10),
  d_val = rnorm(10),
  d_wts = runif(10)
)

Exercise 8

Type in df_paired and hit "Run Code".

df_paired

df_paired

Now you should see the tibble you have created!

Exercise 9

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

Pipe df_paired into function pivot_longer().

df_paired |> 
  pivot_longer()

# df_paired |> 
#   pivot_longer()

You will receive an error message because pivot_longer requires a value for the cols argument.

Exercise 10

Add the argument everything() to the function pivot_longer().

df_paired |> 
  pivot_longer(...())

df_paired |> 
  pivot_longer(everything())

Remember that the cols argument specifies which columns to pivot and everything() specifies that we want to pivot every column.

Exercise 11

Add two more arguments to pivot_longer(): names_to = c("group", ".value") and names_sep = "_".

df_paired |> 
  pivot_longer(...,
               names_to = c("group", ".value"),
               names_sep = "_")

df_paired |> 
  pivot_longer(everything(),
               names_to = c("group", ".value"),
               names_sep = "_")

The names_sep argument tells pivot_longer() what separates a group value from a .value in each of the column names. Here, that's an underscore _.

Here's a quick summary of the above mentioned functions:

.cols specifies which columns you want to iterate over

.fns specifies what to do with each column

.names is used when you need additional control over the names of output columns, which is particularly important when you use across() with mutate().

if_any() and if_all() are two important variations which work with filter().

Reading multiple files

In this section, you'll learn how to use purrr::map() to do something to every file in a directory. Imagine you have a directory full of excel spreadsheets you want to read. You could do it with copy and paste:

data2019 <- readxl::read_excel("data/y2019.xlsx")
data2020 <- readxl::read_excel("data/y2020.xlsx")
data2021 <- readxl::read_excel("data/y2021.xlsx")
data2022 <- readxl::read_excel("data/y2022.xlsx")

And then use dplyr::bind_rows() to combine them all together:

data <- bind_rows(data2019, data2020, data2021, data2022)

You can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four.

Exercise 1

Hit "Run Code".

paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)

paths

paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)

paths

Notice that we used function list.files() here, which basically lists the files in a directory.

This is the code from last exercise: paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)

The first argument, path, is the directory to look in. In this case, it is "data/gapminder".

pattern is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$ or [.]csv$ to find all files with a specified extension.

full.names determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.

Based on the 12 paths created from the last exercise, call read_excel() 12 times to get 12 data frames.

Your code should look something like this:

gapminder_1952 <- readxl::read_excel("data/gapminder/1952.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/1957.xlsx")
gapminder_1962 <- readxl::read_excel("data/gapminder/1962.xlsx")
gapminder_1952 <- readxl::read_excel("data/gapminder/1967.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/1972.xlsx")
gapminder_1962 <- readxl::read_excel("data/gapminder/1977.xlsx")
gapminder_1952 <- readxl::read_excel("data/gapminder/1982.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/1987.xlsx")
gapminder_1962 <- readxl::read_excel("data/gapminder/1992.xlsx")
gapminder_1952 <- readxl::read_excel("data/gapminder/1997.xlsx")
gapminder_1957 <- readxl::read_excel("data/gapminder/2002.xlsx")
gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")

Separate excel sheets are always easier to work with once we put them into a single object. You can use list() to combine them together, then assign the list to a new variable, files.

Your code should look like this:

files <- list(
  readxl::read_excel("data/gapminder/1952.xlsx"),
  readxl::read_excel("data/gapminder/1957.xlsx"),
  readxl::read_excel("data/gapminder/1962.xlsx"),
  ...,
  readxl::read_excel("data/gapminder/2007.xlsx")
)

Exercise 2

Type files[[3]] and hit "Run Code", see what happens. Note that we are using [[]] here instead of ().

...[[3]]

files[[3]]

Remember that after you have data frames in a list, you can use files[[i]] to extract the ith element in the list.

Exercise 3

Type map() in the code chunk, then hit "Run Code".

map()

The map() functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.

Exercise 4

Add argument paths and readxl::read_excel to the call to map().

map(paths, readxl::read_excel)

# map(paths, readxl::read_excel)

The map() vector takes the format fo map(x, f), it is a way to make good use of the path vector.

Exercise 5

Type length(), then hit "Run Code".

length()

# length()

This will give you an error because we haven't added any arguments yet. Function length() can get or set the length of vectors (including lists) and factors, and of any other R object for which a method has been defined.

Exercise 6

Add files, the collection of tibbles we just created, to the function length().

length(files)

Now you know how many files are in files.

Exercise 7

Run the code below to see what happens.

files[[1]]

This is another data structure that doesn’t display particularly compactly with str() so you might want to load it into RStudio and inspect it with View().

Exercise 8

Type list_rbind() in the space below. Hit "Run Code".

list_rbind()

list_rbind() combines elements into a data frame by row-binding them together with vctrs::vec_rbind().

Exercise 9

Add files to the call to list_rbind().

list_rbind(files)

There are two more functions that combine list elements into a single data structure than just list_rbind().

list_c() combines elements into a vector by concatenating them together with vctrs::vec_c().
list_rbind() combines elements into a data frame by row-binding them together with vctrs::vec_rbind().
list_cbind() combines elements into a data frame by column-binding them together with vctrs::vec_cbind().

Exercise 10

Pipe paths into the function map().

paths |> 
  map

Exercise 11

Add \(path) readxl::read_excel(path, n_max = 1) into the function map().

paths |> 
  map(\(path) readxl::read_excel(path, n_max = 1))

n_max = 1 allows us to peak at the first few rows of the data.

Exercise 12

Pipe the code above into function list_rbind().

paths |> 
  map(\(path) readxl::read_excel(path, n_max = 1)) |> 
  list_rbind()

This combination of map() and list_rbind() allows all the spreadsheets to be viewed at once and combined into one. Like what we did above with n_max = 1, you can also customize how many rows you want to see.

Another important observation is that there's no "year" column indicating which row of data belongs to which spreadsheet.

Exercise 13

Pipe paths into set_names(basename)

paths |> set_names(basename)

Using basename() in the set_names() function, we can extract just the file name from the full path.

Exercise 14

Pipe the previous code into map(readxl::read_excel)

paths |> 
  set_names(basename) |> 
  map(readxl::read_excel)

The names created in the last exercise are automatically carried along by all the map() functions, so the list of data frames will have those same names.

Exercise 15

Assign a new variable files to the previous code.

file <- paths |> 
  set_names(basename) |> 
  map(readxl::read_excel)

What we've created above is actually shorthand for:

files <- list(
  "1952.xlsx" = readxl::read_excel("data/gapminder/1952.xlsx"),
  "1957.xlsx" = readxl::read_excel("data/gapminder/1957.xlsx"),
  "1962.xlsx" = readxl::read_excel("data/gapminder/1962.xlsx"),
  ...,
  "2007.xlsx" = readxl::read_excel("data/gapminder/2007.xlsx"))

Imagine how much time we would save!

Exercise 16

Type files[[""]], with any year.xlsx between the ""

files[["1962.xlsx"]]

You should get the file of the respective year you chose!

Exercise 17

Copy and paste code from Exercise 14 to the space below. Hit "Run Code".

paths |> 
  set_names(basename) |> 
  map(readxl::read_excel)

Exercise 18

Pipe the code above into list_rbind(), then add argument names_to = "year" to this function.

paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  ...(names_to = "year")

The argument names_to gives the name of the variable that will be created from the data stored in the column names

Exercise 19

Pipe the previous code into mutate() to create a new column, with the argument year = parse_number(year).

paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))

parse_number parses the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number. The grouping mark specified by the locale is ignored inside the number.

Overall, we used the names_to argument to list_rbind() to tell it to save the names into a new column called year then use readr::parse_number() to extract the number from the string.

Exercise 20

Assign the previous code to the variable gapminder

gapminder <- paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(n...) |> 
  mutate(...)

Save your work

Exercise 1

Type write_csv() in the empty code chunk below, hit "Run Code".

write_csv()

In brief, write_csv() is used to write a data frame to a delimited file.

Exercise 2

Add arguments gapminder and "gapminder.csv" to the call to write_csv()

write_csv(gapminder, "gapminder.csv")

This is how we save the work we have previously done, a single csv file is now created and ready for future use.

Exercise 3

Run the following code.

df <- read_csv(paths)

Exercise 4

Pipe the newly created variable df into filter(), with argument !is.na(id).

df |> 
    filter(!is.na(id))

is.na() is used to deal with missing values in the dataset or data frame.

Exercise 5

Pipe the code above into mutate() with argument id = tolower(id).

  df |> 
    filter(!is.na(id)) |> 
    mutate(...)

Exercise 6

Then use pivot_longer() and add arguments jan:dec and names_to = "month" to the call. Make sure to add a , between the two arguments.

  df |> 
    filter(!is.na(id)) |> 
    mutate(id = tolower(id)) |> 
    pivot_longer(...)

Just a recap, pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

Exercise 7

Assign the name process_file to function(path), followed by a {} with the previous code as the body.

process_file <- function(path) {
  df <- read_csv(path)

  df |> 
    filter(!is.na(id)) |> 
    mutate(id = tolower(id)) |> 
    pivot_longer(jan:dec, names_to = "month")
}

In this step, we write a function that takes a file and does all the modifications (in this case, we read a bunch of files, filter out missing values, pivot, and then combine).

Exercise 8

Pipe paths into map().

paths |> 
  map()

Exercise 9

Add process_file as an argument to map().

paths |> 
  map(process_file)

Exercise 10

Pipe the above into list_rbind().

paths |> 
  map(process_file) |> 
  list_rbind()

This is an example of doing one round of iteration with a complex function.

Exercise 11

Pipe paths into map(readxl::read_excel)

paths |> 
  map(readxl::read_excel)

Recall that this code helps to load all files needed.

With a homogeneous data, we are able to go from map() straight to list_rbind(). However, when the data frames are heterogeneous, list_rbind() either fails or yields a data frame that’s not very useful. In this case, we can create a handy df_types function that returns a tibble with one row for each column.

Exercise 12

Type tibble() and hit "Run Code".

tibble()

Exercise 13

Let's add the argument col_name with value names(df) to the call to tibble().

tibble(
    col_name = names(df))

Recall that this argument col_name retrieves/sets column names.

Exercise 14

Then, add the second argument, col_type, then set its value to be map_chr(df, vctrs::vec_ptype_full).

tibble(
    col_name = names(df), 
    col_type = map_chr(df, vctrs::vec_ptype_full))

col_type specifies which types the columns in your imported data frame should have.

map_chr() returns a character vector, which is then specified by vctrs::vec_ptype_full.

vec_ptype_full() displays the full type of the vector.

Exercise 15

Add a new argument n_miss to the previous code, with value map_int(df, \(x) sum(is.na(x))).

tibble(
    col_name = names(df), 
    col_type = map_chr(df, vctrs::vec_ptype_full),
    n_miss = map_int(df, \(x) sum(is.na(x))))

Exercise 16

To make the above code into a function, type function(df) {} in front of tibble(), remember to puttibble() inside {}.

function(df) {
  tibble(
    col_name = names(df), 
    col_type = map_chr(df, vctrs::vec_ptype_full),
    n_miss = map_int(df, \(x) sum(is.na(x))))}

The previous chapter gave a detailed break down of how a new function is defined.

Exercise 17

Assign the new function a name as df_types.

... <- function(df) {
  tibble(
    col_name = names(df), 
    col_type = map_chr(df, vctrs::vec_ptype_full),
    n_miss = map_int(df, \(x) sum(is.na(x))))}

This assignment will make life easier for us later on!

Exercise 18

Type df_types() and an argument gapminder, hit "Run Code", see what happens.

df_types(gapminder)

You should have a 6x3 tibble!

There are two additional functions worth mentioning.

map_if() allows you to selectively modify elements of a list based on their values.

map_at() allows you to selectively modify elements based on their names.

Exercise 19

Pipe paths into map().

paths |> 
  map()

map() has a downside: it succeeds or fails as a whole. It will either successfully read all of the files in a directory or fail with an error, reading zero files.

Exercise 20

Add an argument possibly() to the call to map().

paths |> 
  map(possibly())

possibly() is a function operator: it takes a function and returns a function with modified behavior.

In particular, possibly() changes a function from erroring to returning a value that you specify.

Exercise 21

Add arguments \(path) readxl::read_excel(path) and NULL to the call to possibly().

paths |> 
  map(possibly(\(path) readxl::read_excel(path), NULL))

possibly() consists of two main values: .f and otherwise.

.f is a function to modify.

otherwise is a default value to use when an error occurs.

Exercise 22

Assign the above code to variable files.

... <- paths |> 
  map(possibly(\(path) readxl::read_excel(path), NULL))

Exercise 23

Pipe files into list_rbind().

files |> list_rbind()

This works particularly well here because list_rbind(), like many tidyverse functions, automatically ignores NULLs.

Exercise 24

With function map_vec(), add arguments files and is.null to the call.

map_vec(files, is.null)

Exercise 25

Assign failed as a new variable of the previous code.

failed <- map_vec(files, is.null)

Exercise 26

Hit "Run Code".

paths[failed]

You should see paths that failed.

Then call the import function again for each failure and figure out what went wrong.

Just a recap, there are three basic steps:

use list.files() to list all the files in a directory, then use purrr::map() to read each of them into a list, then use purrr::list_rbind() to combine them into a single data frame. We'll then discuss how you can handle situations of increasing heterogeneity, where you can't do exactly the same thing to every file.

Saving Multiple Outputs

In the last section, you learned about map(), which is useful for reading multiple files into a single object. In this and the following sections, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:

Saving multiple data frames into one database.
Saving multiple data frames into multiple .csv files.
Saving multiple plots to multiple .png files.

Exercise 1

Sometimes when working with many files, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv). One approach to deal with this problem is to load your data into a database. Run the code in the code chunk below.

con <- DBI::dbConnect(duckdb::duckdb())

con <- DBI::dbConnect(duckdb::duckdb())
dbDisconnect(con, shutdown=TRUE)

con is a connection to a database created by the duckdb package. Note that because of limits on learnr tutorials, we will need to run this command --- and any other commands which modify the database --- in each Exercise block for this tutorial. Apologies for the repetition!

Exercise 2

We need to start by creating a table that we will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. Assign template to readxl::read_excel(paths[[1]]).

template <- readxl::...(paths[[1]])

template <- readxl::read_excel(paths[[1]])

Other databases might allow us to load Excel files directly. With duckdb, however, we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of CSVs and the database that you’re working with doesn’t have one function that will load them all in.

Exercise 3

Take the assignment from the previous question, add a pipe, and then use mutate() to add a new variable, year, and assign it a value of 1952.

template <- readxl::read_excel(paths[[1]]) |> 
  mutate(... = 1952)

template <- readxl::read_excel(paths[[1]]) |> 
  mutate(year = 1952)

We are about to add in data from a bunch of different files, each from a different year. So, we need to ensure that there is a year column in the template in which we can store that year information.

Exercise 4

Run this code.

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)
dbDisconnect(con, shutdown=TRUE)

Note that dbCreateTable() doesn’t use the data in template, just the variable names and types.

Exercise 5

Now, let's test it by piping con into tbl("gapminder"). (We provide the creation code. You add the pipe below it.)

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

con |> 
  tbl("gapminder")

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

con |> 
  tbl("gapminder")
dbDisconnect(con, shutdown=TRUE)

The gapminder table is empty now but it has the variables we need with the types we expect. You can see this explicitly if you add colnames() to the pipe and re-run the code.

Exercise 6

We need a function which will accept a path to a correctly formatted Excel file, read the file into a tibble, add the appropriate year variable, and then append that tibble to the gapminder table we have created in our duckdb database.

Run this code.

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

Before creating a function like this, we often write the code which accomplishes what we want for a single case. With that script as a guide, we can create a function which is often just a "wrap" around that code.

Exercise 7

Below the provided code, pipe paths into map(append_file).

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

paths |> 
  map(...)

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
}

# DK: This causes a failure with R CMD check. Why?

# paths |>
#   map(append_file)

dbDisconnect(con, shutdown=TRUE)

Again, it is annoying that we need to repeat so much code each Exercise.

Exercise 8

But we don’t care about the output of append_file(), so instead of map() it’s slightly nicer to use walk(). walk() does exactly the same thing as map() but throws the output away. Pipe paths into walk(append_file).

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
}

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

paths |> 
  ...(append_file)

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

paths |> 
  walk(append_file)
dbDisconnect(con, shutdown=TRUE)

Exercise 9

Now we can check if we have all data in our table. Start a new line at the end of this code, piping con into tbl("gapminder").

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

paths |> 
  walk(append_file)

...

con |> 
  tbl("gapminder")

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

paths |> 
  walk(append_file)

con |> 
  tbl("gapminder")
dbDisconnect(con, shutdown=TRUE)

We have successfully loaded a dozen Excel files into a database table.

Exercise 10

Pipe the code above into count(year).

  ... |> 
  count(year)

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  DBI::dbAppendTable(con, "gapminder", df)
  }

paths |> 
  walk(append_file)

con |> 
  tbl("gapminder") |> 
  count(year)
dbDisconnect(con, shutdown=TRUE)

The gapminder data is so small that we can just work with it in memory. But, for larger amounts of data, we need to use a database like duckdb.

Writing csv files

The same basic principle applies if we want to write multiple CSV files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds data and save one CSV file for each value of clarity.

Exercise 1

Pipe diamonds to group_nest().

diamonds |> 
  ...()

diamonds |> 
  group_nest()

group_nest() is used to nest a tibble using a grouping specification. The output is a tibble with as many rows as there are values for the grouping variable. Because we did not specify a grouping variable, the output is a tibble with one row.

Exercise 2

Add the value clarity to the call to group_nest().

diamonds |> 
  group_nest(...)

diamonds |> 
  group_nest(clarity)

The first argument to group_nest(), .key, is the most important. The resulting tibble has two columns. The first is the same name as the grouping variable and includes the individual values. The second is always called data. It is a list column in which each element is a tibble with all the rows from diamonds which have the specified value of clarity.

Exercise 3

Add a call to mutate() with argument path = str_glue("diamonds-{clarity}.csv") to the pipe.

... |> 
  ...(path = str_glue("diamonds-{clarity}.csv"))

diamonds |> 
  group_nest(clarity) |> 
  mutate(path = str_glue("diamonds-{clarity}.csv"))

str_glue() is a glue function. You now have a 8 x 3 tibble, with a new column that gives the name of an output file.

Exercise 4

Assign the result of this pipe to by_clarity, a new object.

... <- 
  diamonds |> 
    group_nest(clarity) |> 
    mutate(path = str_glue("diamonds-{clarity}.csv"))

by_clarity <- 
  diamonds |> 
    group_nest(clarity) |> 
    mutate(path = str_glue("diamonds-{clarity}.csv"))

Note how we first get some code which does what we want and then, once it is working, we assign the result to a new object. This is the typical way of doing data science because it is so useful, until the very end, for your code to just "spit out" whatever the current result is.

Exercise 5

Type by_clarity and hit "Run Code".

by_clarity

by_clarity

Our new tibble has eight rows and three columns.

Exercise 6

Run the following code.

by_clarity$data[[1]]

by_clarity$data[[1]]

The $ pulls out the data column, which is a list column. The [[1]] pulls out the first element in that list column, which is a tibble. Remember that clarity is our grouping variable and data is a list-column containing one tibble for each unique value of clarity.

Exercise 7

If we were going to save these data frames by hand, we might write something like:

write_csv(by_clarity$data[[1]], by_clarity$path[[1]])
write_csv(by_clarity$data[[2]], by_clarity$path[[2]])
write_csv(by_clarity$data[[3]], by_clarity$path[[3]])
...
write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])

But that would be annoying, especially if there were hundreds of values for clarity. Type ?walk2 to the space below. hit "Run Code".

?walk2

?walk2

This is a little different to our previous uses of map() because there are two arguments that are changing, not just one.

Exercise 8

Call walk2() with values by_clarity$data, by_clarity$path, and write_csv.

walk2(..., by_clarity$path, ...)

walk2(by_clarity$data, by_clarity$path, write_csv)

We needed a new function, map2(), which varies both the first and second arguments. Because we again don’t care about the output, we want walk2() rather than map2().

Exercise 9

Run list.files() with the pattern argument set to "*.csv".

list.files(pattern = ...)

list.files(pattern = "*.csv")

This should produce the 8 CSV files which you just created. Make sure that it does not return any files which you care about since, in the next question, we will be deleting all these files.

Exercise 10

Always clean up the files which you create, unless you want to keep them. Pipe list.files(pattern = "*.csv") to file.remove().

list.files(pattern = "*.csv") |> 
  ...()

list.files(pattern = "*.csv") |> 
  file.remove()

This removes all the files you created. It should return a logical vector of length 8, which each element equal to TRUE. This is what file.remove() returns if the removal is successful.

Saving plots

We can take the same basic approach in order to create many plots.

Exercise 1

Below is a new function that draws the plot we want. Hit "Run Code".

carat_histogram <- function(df) {
  ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)  
}

This function will only work if the tibble which you pass in as the value for the df argument includes a column called carat. If it doesn't, you will get an error like "object 'carat' not found".

Exercise 2

Now we can use map() to create a list of many plots and their eventual file paths.

Pipe by_clarity into mutate(). Add arguments plot = map(data, carat_histogram) and path = str_glue("clarity-{clarity}.png") to the function.

by_clarity |> 
  mutate(
    plot = map(data, carat_histogram),
    path = str_glue("clarity-{clarity}.png")
  )

This step is similar to what we did earlier.

Exercise 3

Assign the name by_clarity to the code above.

by_clarity <- by_clarity |> 
  mutate(
    plot = map(data, carat_histogram),
    path = str_glue("clarity-{clarity}.png")
  )

Exercise 4

Type walk2() then add arguments by_clarity$path and by_clarity$plot.

walk2(
  by_clarity$path,
  by_clarity$plot)

Exercise 5

Add \(path, plot) and ggsave() to the call to walk2().

walk2(
  by_clarity$path,
  by_clarity$plot,
  \(path, plot) ggsave())

Exercise 6

Add arguments path, plot, width = 6, height = 6 to ggsave().

walk2(
  by_clarity$path,
  by_clarity$plot,
  \(path, plot) ggsave(path, plot, width = 6, height = 6)

Remember that ggsave() is a convenient function for saving a plot. It defaults to saving the last plot that you displayed, using the size of the current graphics device. It also guesses the type of graphics device from the extension.

Summary

This tutorial covered Chapter 26: Iteration from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We used the purrr package to learn tools for iteration, repeatedly performing the same action on different objects. Key functions included across(), map(), and user-created anonymous functions.

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

r4ds.tutorials Tutorials for "R for Data Science"

Iteration In r4ds.tutorials: Tutorials for "R for Data Science"

Introduction

Modifying multiple columns

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Selecting columns with .cols

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Column names

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

across() in functions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Reading multiple files

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Exercise 18

Exercise 19

Exercise 20

Save your work

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

r4ds.tutorials
Tutorials for "R for Data Science"

Iteration
In r4ds.tutorials: Tutorials for "R for Data Science"