Arrow

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(arrow)
library(dbplyr, warn.conflicts = FALSE)
library(duckdb)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(out.width = '90%')
options(tutorial.exercise.timelimit = 2400, 
        tutorial.storage = "local") 

# Despite much experimentation, I could not get the major download to work when
# executed inside of an Exercise block. So, that means we need do at least that
# outside of the tutorial itself. 

# We have the student download the data (and then manipulate it) in the temp
# directory associated with her current R session. This is somewhat convenient
# in that, when she restarts R, all the files are cleaned up. That is good since
# we don't want a bunch of 9 gig files filling up her hard drive. But it is bad
# in that she can't restart her R session while doing this tutorial! Would it be
# better to store the data on her Desktop? Or in a "data" directory, which we
# create within the current working directory?

# Key question: Should the student do most/all of her work in the Console,
# CP/CR'ing along the way or should she do most/all of the work in the Exercise
# code blocks. The former allows for using the real data. The latter could only
# work with our test data. The cool approach would be to provide both! Ideally,
# students would do stuff on their own, but, if they had computer problems or
# were working on Posit Cloud, they could still use our test data. Leave that
# for the next version of the tutorial. For now, we assign everything in the
# Console. But all our test code chunks use our test data.

csv_path <- paste0(tempdir(), "/seattle-library-checkouts.csv")
pq_path <- paste0(tempdir(), "/seattle-library-checkouts")

# Key problem now, for the parquet format section, is that seattle_csv does not
# exist. So, we used this code to create a tiny version of it. Of course, this
# assumes that you have already downloaded the data and that you are currently
# located in the r4ds.tutorial project directory.

# For some reason, you need to collect() before you can sample. Of course, this
# code is only run by hand, when we need to fix the test data. It is annoying
# that we only keep 100 rows, but R CMD check forces us to keep this package as
# small as possible.

# open_dataset(sources = csv_path,
#              col_types = schema(ISBN = string()),
#              format = "csv") |>
#   collect() |>
#   slice_sample(n = 100) |>
#   write_csv(file = "inst/tutorials/22-arrow/data/test.csv")

# We can now create seattle_csv to use in our test code chunks. We can also
# write out, and then work with, parquet files.

seattle_csv <- open_dataset(
  sources = "data/test.csv", 
  col_types = schema(ISBN = string()),
  format = "csv"
)

seattle_csv |>
  group_by(CheckoutYear) |>
  write_dataset(path = pq_path,
                format = "parquet")

seattle_pq <- open_dataset(pq_path)

query <- seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(CheckoutYear, CheckoutMonth)


Introduction

This tutorial covers Chapter 22: Arrow from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We will learn about the arrow package for working with Apache Arrow, a cross-language development platform for in-memory and larger-than-memory data. Key functions include open_dataset() and write_dataset().

Note that the data used in this chapter is more than 8 gigabytes. This might be too much for your computer. Most on-line tools, like Posit Cloud, will not allow you to download so much data. If this causes you problems, just read the R4DS chapter: Chapter 22: Arrow.

Instead of writing all your answers inside Exercise code blocks, you will do a fair amount of your work in your R session itself and then copy/paste the command and result (CP/CR) as your answer to the question.

Getting the data

CSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this tutorial, you’ll learn about a powerful alternative: the parquet format, an open standards-based format widely used by big data systems.

Exercise 1

Load the tidyverse library.


library(...)
library(tidyverse)

We’ll pair parquet files with Apache Arrow, a multi-language toolbox designed for efficient analysis and transport of large datasets. We’ll use Apache Arrow via the the arrow package, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.

Exercise 2

Load the arrow library.


library(...)
library(arrow)

As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.

Exercise 3

At the Console, run arrow_info(). CP/CR the first 10 lines or so of the response.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)

There is some chance that the answer will look like this:

> arrow_info()
Arrow package version: 15.0.1

Capabilities:

acero      TRUE
dataset   FALSE
substrait FALSE
parquet   FALSE
...

If so, we have a problem. We need the capabilities for at least "dataset" and "parquet" to be TRUE.

If you have this problem, you should try to install the arrow package from source with:

install.packages("arrow", type = "source", INSTALL_opts = "--no-lock")

You want arrow_info() to look like this:

> arrow_info()
Arrow package version: 15.0.1

Capabilities:

acero      TRUE
dataset    TRUE
substrait FALSE
parquet    TRUE
...

The package version does not matter, as long as it is later than 15.

Exercise 4

Load the duckdb library


library(...)
library(duckdb)

Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet.

Exercise 5

Load the dbplyr library.


library(...)
library(dbplyr)

In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both a database and parquet, and then pick the one that works the best for you.

Exercise 6

Type tempdir() and hit "Run Code" to take a look.


tempdir()
tempdir()

tempdir() returns the same directory each time you run it, at least until you restart your R session. To learn more about temporary files and directories, check out ?tempdir.

Exercise 7

Type paste0(tempdir(), "/seattle-library-checkouts.csv") and hit "Run Code" to examine it. This creates the full path to file called "seattle-library-checkouts.csv" and located in the temporary directory for this R session.


paste0(tempdir(), "/seattle-library-checkouts.csv")
paste0(tempdir(), "/seattle-library-checkouts.csv")

We will download a dataset of item checkouts from Seattle public libraries, available online and give it the name and location of this file. This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.

Exercise 8

In the R Console (not in an Exercise code block), run:

curl::multi_download(
  urls = "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
  destfiles = paste0(tempdir(), "/seattle-library-checkouts.csv"),
  resume = TRUE)

This will download a 9 GB CSV file which will take about 5 minutes. So go check your email (or read more R for Data Science!) and come back in 5 minutes.

CP/CR. (That is, copy/paste the command and result (CP/CR) as your answer to this Exercise.)

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)

We recommend using curl::multi_download() to get very large files as it’s built for exactly this purpose: it gives you a progress bar and it can resume the download if its interrupted. (Sadly, the tutorial blocks our view of the progress bar.)

Exercise 9

For the rest of this tutorial, you will type many (but not all) the commands into your own Console, and then copy/paste the command/response (CP/CR) as your answer. Do not quit the tutorial before you are finished. Doing so will delete your data and you will need to download it again, although your answers will be saved, as usual.

In the Console, run list.files(tempdir()). CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

In addition to the "seattle-library-checkouts.csv" file, there should be some other weird files in the temporary directory. Don't worry about them. (We don't really understand them ourselves, but they probably involve R's attempts to manage this directory.)

Exercise 10

In the Console, create a new variable, csv_path, and assign it to paste0(tempdir(), "/seattle-library-checkouts.csv"). CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)
csv_path <- paste0(tempdir(), "/seattle-library-checkouts.csv")

We just want a more convenient way to store the location/name of the CSV file.

Opening a dataset

Let’s start by taking a look at the data. At 9 GB, this file is large enough that we probably don’t want to load the whole thing into memory. A good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 GB. This means we want to avoid read_csv() and instead use the arrow::open_dataset().

Exercise 1

Run this code in the Console. CP/CR.

seattle_csv <- open_dataset(
  sources = csv_path, 
  col_types = schema(ISBN = string()),
  format = "csv"
)
question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

What happens when this code is run? open_dataset() will scan a few thousand rows to figure out the structure of the dataset. The ISBN column contains blank values for the first 80,000 rows, so we have to specify the column type to help arrow work out the data structure. Once the data has been scanned by open_dataset(), it records what it’s found and stops; it will only read further rows as you specifically request them.

Exercise 2

For this and later Exercises which will be completed in the Console, you may find it useful to open a temporary R script (File -> New File -> R Script). You can write your answer there and then run it in the Console with Cmd/Ctrl + Enter. This makes it easy to edit your code, often by adding more lines to the pipe, and then submit the answer to the next question.

Run seattle_csv in the Console. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_csv 

The first line in the output tells you that seattle_csv is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed. The remainder of the output tells you the column type that arrow has inputted for each column.

Exercise 3

In the Console, pipe seattle_csv into glimpse(). CP/CR.

This file is big, so computations such as glipmse() will take a while to run.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_csv |>
  glimpse()

We can see what’s actually in it with glimpse(). This reveals that there are ~41 million rows and 12 columns, and shows us a few values.

Exercise 4

Let's determine the total number of checkouts. In the Console, pipe seattle_csv to summarise(). Within summarise(), add the argument Checkouts = sum(Checkouts).

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)
seattle_csv |>
  summarise(Checkouts = sum(Checkouts))

This is quick, because no computations have been performed. R has just confirmed that this code should work. To run the calculation, you need to use collect().

Exercise 5

We can start to use this dataset with dplyr verbs, using collect() to force arrow to perform the computation and return some data. In the Console, add collect() to the end of the previous pipe. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_csv |>
  summarise(Checkouts = sum(Checkouts)) |>
  collect()

There were more than 144 millions checkouts.

Exercise 6

To determine total checkouts by year, add .by = CheckoutYear to the call to summarize() in the Console. Continue the pipe to ggplot(), map CheckoutYear to the x-axis and Checkouts to the y-axis. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_csv |>
   summarise(Checkouts = sum(Checkouts), .by = CheckoutYear) |>
   collect() |> ggplot(aes(x = CheckoutYear, y = Checkouts))

Thanks to arrow, this code will work regardless of how large the underlying dataset is. But it’s currently rather slow, taking several seconds to run. That’s not terrible given how much data we have, but we can make it much faster by switching to a better format.

Exercise 7

Add the geom_point() and geom_line() layers. CP/CR

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)
seattle_csv |>
   summarise(Checkouts = sum(Checkouts), .by = CheckoutYear) |>
   collect() |> ggplot(aes(x = CheckoutYear, y = Checkouts)) + geom_point() + geom_line()

We should have a cool looking graph here displaying checkouts by year

Exercise 8

Add labels to your graph.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Next we will be dealing with the parquet format.

The parquet format

To make this data easier to work with, let's switch to the parquet file format and split it up into multiple files. The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.

Exercise 1

In the Console, Set pq_path to paste0(tempdir(), "/seattle-library-checkouts"). CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)
pq_path <- paste0(tempdir(), "/seattle-library-checkouts")

Like CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it’s a custom binary format designed specifically for the needs of big data.

Exercise 2

In the Console, start a pipe with seattle_csv. CP/CR.

Again, instead of just typing everything directly into the Console, you might open a temporary R script and work from there, submitting to the Console with Cmd/Ctrl + Enter.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_csv

Parquet files are usually smaller than the equivalent CSV file. Parquet relies on efficient encodings to keep file size down, and supports file compression. This helps make parquet files fast because there’s less data to move from disk to memory.

Exercise 3

In the Console, continue the pipe with group_by(). Within group_by(), add the argument CheckoutYear.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_csv |>
  group_by(CheckoutYear)

Parquet files have a rich type system. A CSV file does not provide any information about column types. For example, a CSV reader has to guess whether "08-10-2022" should be parsed as a string or a date. In contrast, parquet files store data in a way that records the type along with the data.

Exercise 4

In the Console, continue the pipe with write_dataset(). Within write_dataset() add two arguments: path = pq_path and format = "parquet". This will take a minute but it pays off as it makes future operations much faster.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)
seattle_csv |>
  group_by(CheckoutYear) |>
  write_dataset(path = pq_path,
                format = "parquet")

Parquet files are “column-oriented”. This means that they’re organized column-by-column, much like R’s data frame. This typically leads to better performance for data analysis tasks compared to CSV files, which are organized row-by-row.

Parquet files are “chunked”, which makes it possible to work on different parts of the file at the same time, and, if you’re lucky, to skip some chunks all together.

Exercise 5

Let’s take a look at what we just produced. Run this code in the Console. CP/CR.

tibble(
  files = list.files(pq_path, recursive = TRUE),
  size_MB = file.size(file.path(pq_path, files)) / 1024^2
)
question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)

Our single 9 GB CSV file has been rewritten into 18 parquet files. The file names use a “self-describing” convention used by the Apache Hive project. Hive-style partitions name folders with a “key = value” convention, so as you might guess, the CheckoutYear = 2005 directory contains all the data where CheckoutYear is 2005. Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. This is as we expect since parquet is a much more efficient format.

Using dplyr with arrow

Exercise 1

Now that we’ve created these parquet files, we’ll need to read them in again. We use open_dataset() again, but this time we give it a directory.

In the Console, assign seattle_pq to open_dataset() with the argument pq_path inside. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)
seattle_pq <- open_dataset(pq_path)

As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it’s often useful to split large datasets across many files. When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.

Exercise 2

Run seattle_pq at the Console. Again, since we are building a pipe, you probably want to write seattle_pq in a temporary R script and then send it to the Console with Cmd/Ctrl + Enter. However you do it, CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_pq

There are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data. You’re likely to need to do some experimentation before you find the ideal partitioning for your situation.

Exercise 3

Continue the pipe with filter(). Within filter(), add two arguments: CheckoutYear >= 2018 and MaterialType == "BOOK". Make sure to separate them with a comma. Send to the Console. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") 

As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. You should also try to partition by variables that you filter by; as you’ll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.

Exercise 4

Continue the pipe with group_by(). Within group_by(), add the arguments CheckoutYear, CheckoutMonth.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth)

Writing dplyr code for arrow data is conceptually similar to dbplyr: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call collect().

Exercise 5

Continue the pipe with summarize(). Within summarize(), add TotalCheckouts = sum(Checkouts). Send to the Console. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts))

This will create a new column called TotalCheckouts that is the sum of checkouts for that year/month.

Exercise 6

Continue the pipe with arrange(). Within arrange(), add the argument CheckoutYear, CheckoutMonth. Send to the Console. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(CheckoutYear, CheckoutMonth)

This will arrange the data so that the latest checkout year/month is at the top of the tibble.

Exercise 7

Assign the whole pipe to query. That is, in your temporary R script, you should have:

query <- seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(CheckoutYear, CheckoutMonth)

Send this to the Console. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)
query <- seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(CheckoutYear, CheckoutMonth)

Writing dplyr code for arrow data is conceptually similar to doing the same for dbplyr: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call collect(). If we print out the query object we can see a little information about what we expect Arrow to return when the execution takes place.

Exercise 8

In the Console, run query. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)
query

Exercise 9

Pipe query to collect() in the Console. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 10)
query|>
  collect()

Like dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would. However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in ?acero.

Exercise 10

Let’s take a quick look at the performance impact of switching from CSV to parquet. First, let’s time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large CSV. Run this code in your Console.

seattle_csv |> 
  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
  group_by(CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutMonth)) |>
  collect() |> 
  system.time()

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)
seattle_csv |> 
  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
  group_by(CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutMonth)) |>
  collect() |> 
  system.time()

This is slower than we would like. But switching to paquet files will help because partitioning improves performance because this query uses CheckoutYear == 2021 to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.

Exercise 11

Now let’s use our new version of the dataset in which the Seattle library checkout data has been partitioned into 18 smaller parquet files. Run this code in your Console.

seattle_pq |> 
  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
  group_by(CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutMonth)) |>
  collect() |> 
  system.time()

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)
seattle_pq |> 
  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
  group_by(CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutMonth)) |>
  collect() |> 
  system.time()

The parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (CheckoutYear, MaterialType, CheckoutMonth, and Checkouts).

This massive difference in performance is why it pays off to convert large CSVs to parquet!

Exercise 12

There’s one last advantage of parquet and arrow — it’s very easy to turn an arrow dataset into a DuckDB database by calling arrow::to_duckdb(). Run this code in your Console.

seattle_pq |> 
  to_duckdb() |>
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutYear)) |>
  collect()

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)
seattle_pq |> 
  to_duckdb() |>
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutYear)) |>
  collect()

The neat thing about to_duckdb() is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.

Cleaning up

This step is probably unnecessary since R removes the temporary directory it created for the current session every time it exits. But cleaning things up ourselves can't hurt.

Exercise 1

Now, we need to remove that 9 GB dataset which we downloaded and all the other files which we created. Press "Run Code".

unlink(list.files(tempdir(), full.names = TRUE))

Summary

This tutorial covered Chapter 22: Arrow from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We learned about the arrow package for working with Apache Arrow, a cross-language development platform for in-memory and larger-than-memory data. Key functions included open_dataset() and write_dataset().

Read this arrow package vignette for an another example of how to work with a large dataset.




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.