knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(dplyr)

As well as working with local in-memory data like data frames and data tables, dplyr also works with remote on-disk data stored in databases. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle. The reason you'd want to use dplyr with a database is because either your data is already in a database (and you don't want to work with static csv files that someone else has dumped out for you), or you have so much data that it does not fit in memory and you have to use a database. Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql), and google's bigquery.

Since R almost exclusively works with in-memory data, if you do have a lot of data in a database, you can't just dump it into R. Instead, you'll have to work with subsets or aggregates. dplyr aims to make this task as easy as possible. If you're working with large data, it's also likely that you'll need support to get the data into the database and to ensure you have the right indices for good performance. While dplyr provides some simple tools to help with these tasks, they are no substitute for a local expert.

The motivation for supporting databases in dplyr is that you never pull down the right subset or aggregate from the database on your first try. Usually you have to iterate between R and SQL many times before you get the perfect dataset. But because switching between languages is cognitively challenging (especially because R and SQL are so perilously similar), dplyr helps you by allowing you to write R code that is automatically translated to SQL. The goal of dplyr is not to replace every SQL function with an R function; that would be difficult and error prone. Instead, dplyr only generates SELECT statements, the SQL you write most often as an analyst.

To get the most out of this chapter, you'll need to be familiar with querying SQL databases using the SELECT statement. If you have some familiarity with SQL and you'd like to learn more, I found how indexes work in SQLite and 10 easy steps to a complete understanding of SQL to be particularly helpful.

Getting started

The easiest way to experiement with databases using dplyr is to use SQLite. This is because everything you need is already included in the R package. You won't need to install anything, and you won't need to deal with the hassle of setting up a database server. Doing so is really easy: just give the path and the ok to create a table.

my_db <- src_sqlite("my_db.sqlite3", create = T)

The main new concept here is the src, which is a collection of types of database tables. Use src_sqlite(), src_mysql(), src_postgres() and src_bigquery() to connect to the specific types supported by dplyr.

my_db currently has no data in it, so we'll import the flights data using the convenient copy_to() function. This is a quick and dirty way of getting data into a database. Because all the data has to flow through R, you should note that this is not suitable for very large datasets.

library(nycflights13)
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(
  c("year", "month", "day"), "carrier", "tailnum"))

As you can see, the copy_to() operation has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, by carrier and by plane. copy_to() also executes the SQL ANALYZE command: this ensures that the database has up-to-date table statistics and performs the appropriate query optimisations.

For this particular dataset, there's a built-in src that will cache flights in a standard location:

flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
flights_sqlite

You can also use arbitrary SQL:

tbl(my_db, sql("SELECT * FROM flights"))

Basic verbs

Whether you're dealing with remote or local data sources, you use the same five verbs:

select(flights_sqlite, year:day, dep_delay, arr_delay)
filter(flights_sqlite, dep_delay > 240)
arrange(flights_sqlite, year, month, day)
mutate(flights_sqlite, speed = air_time / distance)
summarise(flights_sqlite, delay = mean(dep_time))

The most important difference is that the expressions in select(), filter(), arrange(), mutate(), and summarise() are translated into SQL so they can be run on the database. While the translations for the most common operations are almost perfect there are some limitations, which you'll learn about later.

Laziness

When working with databases, dplyr tries to be as lazy as possible:

For example, take the following code:

c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- mutate(c2, speed = distance / air_time * 60)
c4 <- arrange(c3, year, month, day, carrier)

Suprisingly, this sequence of operations never actually touches the database. It's not until you ask for the data (e.g. by printing c4) that dplyr generates the SQL and requests the results from the database. Even then it only pulls down 10 rows.

c4

To pull down all the results use collect(), which returns a tbl_df():

collect(c4)

You can see the query dplyr has generated by looking at the query component of the object:

c4$query

You can also ask the database how it plans to execute the query with explain(). The output for SQLite is described in more detail on the SQLite website. It's helpful if you're trying to figure out which indexes are being used.

explain(c4)

Forcing computation

There are three ways to force the computation of a query:

collect() is the function you'll use most. Once you reach the set of operations you want, you use collect() to pull the data into a local tbl_df(). If you know SQL, you can use compute() and collapse() to optimise performance.

Performance considerations

dplyr tries to prevent you from accidentally performing expensive query operations:

SQL translation

When performing the simple mathematical operations used when filtering, mutating or summarising, translating R code to SQL (or indeed to any programming language) is relatively straightforward.

To experiment with this, use translate_sql(). The following examples work through some of the basic differences between R and SQL.

# In SQLite variable names are escaped by double quotes:
translate_sql(x)
# And strings are escaped by single quotes
translate_sql("x")

# Many functions have slightly different names
translate_sql(x == 1 && (y < 2 || z > 3))
translate_sql(x ^ 2 < 10)
translate_sql(x %% 2 == 10)

# R and SQL have different defaults for integers and reals.
# In R, 1 is a real, and 1L is an integer
# In SQL, 1 is an integer, and 1.0 is a real
translate_sql(1)
translate_sql(1L)

dplyr knows how to convert the following R functions to SQL:

The basic techniques that underlie the implementation of translate_sql() are described in the Advanced R book. translate_sql() is built on top of R's parsing engine and has been carefully designed to generate correct SQL. It also protects you against SQL injection attacks by correctly escaping the strings and variable names needed by the database that you're connecting to.

Perfect translation is not possible because databases don't have all the functions that R does. The goal of dplyr is to provide a semantic rather than a literal translation: what you mean rather than what is done. In fact, even for functions that exist both in databases and R, you shouldn't expect results to be identical; database programmers have different priorities than R core programmers. For example, in R in order to get a higher level of numerical accuracy, mean() loops through the data twice. R's mean() also provides a trim option for computing trimmed means; this is something that databases do not provide. Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. This means the essence of simple calls like mean(x) will be translated accurately, but more complicated calls like mean(x, trim = 0.5, na.rm = TRUE) will raise an error:

translate_sql(mean(x, trim = T))
# Error: Invalid number of args to SQL AVG. Expecting 1

Any function that dplyr doesn't know how to convert is left as is. This means that database functions that are not covered by dplyr can be used directly via translate_sql(). Here a couple of examples that will work with SQLite:

translate_sql(glob(x, y))
translate_sql(x %like% "ab*")

Grouping

SQLite lacks the window functions that are needed for grouped mutation and filtering. This means that the only really useful operations for grouped SQLite tables are found in summarise(). The grouped summarise from the introduction translates well - the only difference is that databases always drop NULLs (their equivalent of missing values), so we don't supply na.rm = TRUE.

by_tailnum <- group_by(flights_sqlite, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance),
  delay = mean(arr_delay)
)
delay <- filter(delay, count > 20, dist < 2000)
delay_local <- collect(delay)

Other databases do support window functions. You can learn about them in the corresponding vignette. It's sometimes possible to simulate grouped filtering and mutation using self joins, which join the original table with a summarised version, but that topic is beyond the scope of this introduction.

Other databases

Aside from SQLite, the overall workflow is essentially the same regardless of the database you're connecting to. The following sections go in to more details about the peculiarities of each database engine. All of these databases follow a client-server model - a computer that connects to the database and the computer that is running the database (the two may be one and the same but usually isn't). Getting one of these databases up and running is beyond the scope of this article, but there are plenty of tutorials available on the web.

PostgreSQL

src_postgres() has five arguments: dbname, host, port, user and password. If you are running a local postgresql database with the default settings you only need dbname. But in most cases, you'll need all five. dplyr uses the RPostgreSQL package to connect to postgres databases. This means that you can't currently connect to remote databases that require a SSL connection (e.g. Heroku).

For example, the following code allows me to connect to a local PostgreSQL database that contains a copy of the flights data:

flights_postgres <- tbl(src_postgres("nycflights13"), "flights")

PostgreSQL is a considerably more powerful database than SQLite. It has:

The following examples show how we can perform grouped filter and mutate operations with PostgreSQL. Because you can't filter on window functions directly, the SQL generated from the grouped filter is quite complex; so they instead have to go in a subquery.

daily <- group_by(flights_postgres, year, month, day)

# Find the most and least delayed flight each day
bestworst <- daily %>% 
  select(flight, arr_delay) %>% 
  filter(arr_delay == min(arr_delay) || arr_delay == max(arr_delay))
bestworst$query

# Rank each flight within a daily
ranked <- daily %>% 
  select(arr_delay) %>% 
  mutate(rank = rank(desc(arr_delay)))
ranked$query

MySQL and MariaDB

You can connect to MySQL and MariaDB (a recent fork of MySQL) using src_mysql(), mediated by the RMySQL package. Like PostgreSQL, you'll need to provide a dbname, username, password, host, and port.

In terms of functionality, MySQL lies somewhere between SQLite and PostgreSQL. It provides a wider range of built-in functions, but it does not support window functions (so you can't do grouped mutates and filters).

BigQuery

BigQuery is a hosted database server provided by Google. To connect, you need to provide your project, dataset and optionally a project for billing (if billing for project isn't enabled). After you create the src, your web browser will open and ask you to authenticate. Your credentials are stored in a local cache, so you should only need to do this once.

BigQuery supports only one SQL statement: SELECT. Fortunately this is all you need for data analysis. Within SELECT, BigQuery provides comprehensive coverage at a similar level to PostgreSQL.

Picking a database

If you don't already have a database, here's some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with, but the lack of window functions makes it limited for data analysis. PostgreSQL is not too much harder to use and has a wide range of built-in functions. Don't bother with MySQL/MariaDB: it's a pain to set up and the documentation is subpar. Google BigQuery might be a good fit if you have very large data, or if you're willing to pay (a small amount of) money to someone who'll look after your database.



sctyner/dplyr050 documentation built on May 17, 2019, 2:22 p.m.