library(learnr) library(tutorial.helpers) library(DBI) library(dbplyr) library(tidyverse) library(duckdb) library(nycflights13) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") my_sql <- " SELECT carat, cut, clarity, color, price FROM diamonds WHERE price > 15000 " # We use diamonds_db in a couple places, so we need a permanent version of it. con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) diamonds_db <- tbl(con, "diamonds") # Copying all the nycflights takes a long time. So, we only want to do it once, # here. And then use that connection in the SQL basics section. Need a way # to warn students when the tutorial is knitting, and gets stuck at 1% for a # minute or two, that there is nothing to worry about. con_nyc <- DBI::dbConnect(duckdb::duckdb()) dbplyr::copy_nycflights13(con_nyc) flights <- tbl(con_nyc, "flights") planes <- tbl(con_nyc, "planes") # Might consider a way to delete this connection once the tutorial is over. # Could have a question with filled-in code which deletes the connections which # the student could run. Is that useful?
This tutorial covers Chapter 21: Databases from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to use DBI and the dbplyr packages to connect to databases and execute SQL.
In this tutorial, you’ll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL query. SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, we’re not going to start with SQL, but instead we’ll teach you the dbplyr package, which can translate your dplyr code to the SQL. We’ll use that as a way to teach you some of the most important features of SQL.
Run library()
for the DBI package.
library(...)
library(DBI)
At the simplest level, you can think about a database as a collection of data frames, called tables in database terminology. Like a data frame, a database table is a collection of named columns, where every value in the column is the same type.
Run library()
for the dbplyr package.
library(...)
library(dbplyr)
There are three high level differences between data frames and database tables:
Database tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).
Database tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles don’t have indexes, but data.tables (from the data.table package) do, which is one of the reasons that they’re so fast.
Most classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R. More recently, there’s been much development of column-oriented databases that make analyzing the existing data much faster.
Run library()
for the tidyverse package.
...(tidyverse)
library(tidyverse)
Databases are run by database management systems (DBMS’s for short), which come in three basic forms:
Client-server DBMS’s run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organization.
Cloud DBMS’s, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.
In-process DBMS’s run entirely on your computer. They’re great for working with large datasets where you’re the primary user.
Run library()
for the duckdb package.
...(duckdb)
library(duckdb)
Setting up a client-server or cloud DBMS would be a pain for this tutorial, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb.
Run ?duckdb
in the console after loading up the package. Copy/Paste the Description.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.
Run library()
for the nycflights13 package.
...(nycflights13)
library(nycflights13)
We will use data from the nycflights13 package.
Run ?flights
in the console after loading the nycflights
library. Copy/Paste the description.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
In this tutorial, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.
To connect to the database from R, you’ll use a pair of packages:
You’ll always use DBI (database interface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.
You’ll also use a package tailored for the DBMS you’re connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. There’s usually one package for each DBMS, e.g. RPostgres for PostgreSQL and RMariaDB for MySQL.
You create a database connection using dbConnect()
. The first argument selects the DBMS, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
Run dbConnect()
with the argument duckdb()
.
dbConnect(...())
# Need to close any connections you create, and it seems like the only way to do # that (?) is to explicitly assign the connection to an object. con <- dbConnect(duckdb()) dbDisconnect(con, shutdown=TRUE)
If you can’t find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS's. odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.
Since we loaded the DBI and duckdb packages, we don't need to specify them using the ::
notation when we call dbConnect
(which makes the connection) and duckdb()
(which creates a temporary database). But, for safety, most of the time we do include the package name in these calls. So, copy the code from the previous question and replace dbConnect
with DBI::dbConnect
and duckdb()
with duckdb::duckdb()
.
DBI::dbConnect(...::duckdb())
con <- DBI::dbConnect(duckdb::duckdb()) dbDisconnect(con, shutdown=TRUE)
This might create warning message. Ignore it if so.
The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can’t cover all the details here. This means you’ll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (database administrator). The initial setup will often take a little fiddling (and maybe some googling) to get it right, but you’ll generally only need to do it once.
Assign the result of the call to DBI::dbConnect()
to an object named con
. (There is nothing special about the name con
, which stands for connection. But using con
to identify the connection object is fairly standard.)
.. <- DBI::dbConnect(duckdb::duckdb())
con <- DBI::dbConnect(duckdb::duckdb()) dbDisconnect(con, shutdown=TRUE)
Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R.
But it's also a problem for a tutorial like this one because each Exercise is evaluated in isolation from everything else in the tutorial. That means we need to establish the connection in each Exercise. No worries though! We will provide the code for doing so, as needed.
If you want to use duckdb for a real data analysis project, you’ll also need to supply the dbdir argument to make a persistent database and tell duckdb()
where to save it. Assuming you’re using an R project, it’s reasonable to store it in a duckdb directory for the current project.
To do so, add dbdir = "duckdb"
as the second argument to your call to dbConnect()
.
con <- DBI::dbConnect(duckdb::duckdb(), ... = "duckdb")
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb") dbDisconnect(con, shutdown=TRUE) invisible(file.remove(c("duckdb", "duckdb.wal")))
We won't use a permanent database for the rest of this tutorial.
duckdb is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very easy to get started with, but it’s also capable of handling gigabytes of data with great speed.
Let's write the mpg
data to the database we just created. Use dbWriteTable()
and within it put the arguments con
, "mpg"
, and ggplot2::mpg
.
con <- DBI::dbConnect(duckdb::duckdb())
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(..., "mpg", ...)
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "mpg", ggplot2::mpg) dbDisconnect(con, shutdown=TRUE)
When we use dbWriteTable()
we need to supply three arguments: a database connection, the name of the table to create in the database, and a data frame of data.
Let's write the diamonds
data to the database we just created. Use dbWriteTable()
and within it put the arguments con
, "diamonds"
, and ggplot2::diamonds
.
con <- DBI::dbConnect(duckdb::duckdb())
con <- DBI::dbConnect(duckdb::duckdb()) ...(con, "...", ggplot2::diamonds)
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) dbDisconnect(con, shutdown=TRUE)
We use the packages tailored for the DBMS we’re connecting to. The DBI package translates the generic datbase commands into the specifics needed for a given DBMS. There’s usually one package for each DBMS, e.g. RPostgres
for PostgreSQL and RMariaDB
for MySQL.
You can check that the data is loaded correctly by using the dbListTables()
function that lists all tables in the database. Run dbListTables()
on con
.
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "mpg", ggplot2::mpg) dbWriteTable(con, "diamonds", ggplot2::diamonds)
... dbListTables(...)
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "mpg", ggplot2::mpg) dbWriteTable(con, "diamonds", ggplot2::diamonds) dbListTables(con) dbDisconnect(con, shutdown=TRUE)
If you’re using duckdb in a real project, we highly recommend learning about duckdb_read_csv()
and duckdb_register_arrow()
. These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.
Note that we will need to recreate the connection, and load any required data, for each new Exercise. But don't worry! We will supply the necessary code.
Let's use another DBI function: dbReadTable()
. Start a pipe with con
and use dbReadTable()
with "mpg"
as the only argument.
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "mpg", ggplot2::mpg)
... con |> dbReadTable(...)
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "mpg", ggplot2::mpg) con |> dbReadTable("mpg") dbDisconnect(con, shutdown=TRUE)
dbReadTable()
returns a data frame, not a tibble.
Start a pipe with con
and use dbReadTable()
with "diamonds"
as the only argument. Continue the pipe with as_tibble()
, thereby returning the data in a nice tibble format.
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds)
... con |> dbReadTable("diamonds") |> ...()
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) con |> dbReadTable("diamonds") |> as_tibble() dbDisconnect(con, shutdown=TRUE)
Connection requests often look more like these examples:
con <- DBI::dbConnect( RMariaDB::MariaDB(), username = "foo" ) con <- DBI::dbConnect( RPostgres::Postgres(), hostname = "databases.mycompany.com", port = 1234 )
You sometimes also need to provide a login and password. In general, the owner of the database will provide necessary details.
Now let's use SQL to select specific variables from the database. Run this code.
my_sql <- " SELECT carat, cut, clarity, color, price FROM diamonds WHERE price > 15000 "
my_sql <- " SELECT carat, cut, clarity, color, price FROM diamonds WHERE price > 15000 "
This SQL code selects 5 specific variables, from the diamonds
dataset, and then filters the rows to only keep those with price > 15000
.
SQL for Data Scientists by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organizations.
dbGetQuery()
is a function which takes a connection and an SQL query as a text string as inputs. It returns the result of that query as a data frame.
Type dbGetQuery()
with the arguments con
and my_sql
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds)
... dbGetQuery(..., ...)
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) dbGetQuery(con, my_sql) dbDisconnect(con, shutdown=TRUE)
If you’ve never seen SQL before, don’t worry! You’ll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where price is greater than 15,000.
The last Exercise returned a data frame, which is always annoying, especially when there are a lot of rows, as in this case. Reuse the code, but pipe the result from dbQuery()
into as_tibble()
.
... dbGetQuery(con, my_sql) |> ...()
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) dbGetQuery(con, my_sql) |> as_tibble() dbDisconnect(con, shutdown=TRUE)
Practical SQL by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.
Now that we’ve connected to a database and loaded up some data, we can start to learn about dbplyr. dbplyr is a dplyr backend, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include dtplyr which translates to data.table, and multidplyr which executes your code on multiple cores.
Run tbl()
on the two arguments: con
and "diamonds"
.
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds)
... tbl(..., ...)
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) tbl(con, "diamonds") dbDisconnect(con, shutdown=TRUE)
tbl()
is a function dtplyr which creates a table. Note that a table is not the same thing as a tibble. Note that there may be message at the top:
Source: table<diamonds> [?? x 10] Database: DuckDB v0.10.0 [root@Darwin 23.4.0:R 4.3.2/:memory:]
Your "Database" information will probably be different.
Assign the result of your call to tbl()
to an object called diamonds_db
.
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds)
... ... <- tbl(con, "diamonds")
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) diamonds_db <- tbl(con, "diamonds") dbDisconnect(con, shutdown=TRUE)
There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organized. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:
diamonds_db <- tbl(con, in_schema("sales", "diamonds")) diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
Other times you might want to use your own SQL query as a starting point:
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
This object is lazy; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, skip a line and pipe diamonds_db
to filter(price > 15000)
and select(carat:clarity, price)
.
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) diamonds_db <- tbl(con, "diamonds")
... diamonds_db |> ...(price > 15000) |> select(carat:clarity, ...)
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) diamonds_db <- tbl(con, "diamonds") diamonds_db |> filter(price > 15000) |> select(carat:clarity, price) dbDisconnect(con, shutdown=TRUE)
The DBMS name may be printed at the top. If so, it tells you the number of columns, but it typically won't report the number of rows. This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.
Using the code from the previous Exercise, extend the pipe by adding show_query()
.
... diamonds_db |> filter(price > 15000) |> select(carat:clarity, price) |> ...
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) diamonds_db <- tbl(con, "diamonds") diamonds_db |> filter(price > 15000) |> select(carat:clarity, price) |> show_query() dbDisconnect(con, shutdown=TRUE)
You can see the SQL code generated by the dplyr function show_query()
. If you know dplyr, this is a great way to learn SQL! Write some dplyr code, get dbplyr to translate it to SQL, and then try to figure out how the two languages match up.
To get all the data back into R, you call collect()
. Behind the scenes, this generates the SQL, calls dbGetQuery()
to get the data, then turns the result into a tibble.
Using the code from the previous Exercise, extend the pipe by adding collect()
.
... diamonds_db |> filter(price > 15000) |> select(carat:clarity, price) |> show_query() |> ...
con <- DBI::dbConnect(duckdb::duckdb()) dbWriteTable(con, "diamonds", ggplot2::diamonds) diamonds_db <- tbl(con, "diamonds") diamonds_db |> filter(price > 15000) |> select(carat:clarity, price) |> show_query() |> collect() dbDisconnect(con, shutdown=TRUE)
Typically, you’ll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once you’re ready to analyse the data with functions that are unique to R, you’ll collect()
the data to get an in-memory tibble, and continue your work with pure R code.
The rest of the chapter will teach you a little SQL through the lens of dbplyr. It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.
In the background, we have already run this code:
con_nyc <- DBI::dbConnect(duckdb::duckdb()) dbplyr::copy_nycflights13(con_nyc)
Type con_nyc
. Hit "Run Code".
con_nyc
con_nyc
This code is why the tutorial took so long to create. con_nyc
is a connection, just like the ones we have created before. copy_nycflights13()
is a utility function which transforms some of the tibbles from the nycflights package into tables in the duckdb database we have just created.
We’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: flights and planes. Run dbListTables()
on con_nyc
.
dbListTables(...)
dbListTables(con_nyc)
Assign flights
to tbl()
. Within tbl()
, use con_nyc
and "flights"
as the two arguments. On the next line, assign planes
to tbl()
. Within tbl()
use con_nyc
and "planes"
as the first two arguments.
flights <- tbl(..., ...) ... <- ...(con_nyc, "planes")
flights <- tbl(con_nyc, "flights") planes <- tbl(con_nyc, "planes")
Again, these objects are not regular tibbles. They are tables (R class tbl
), connections to the specified table in our duckdb database.
The top-level components of SQL are called statements. Common statements include CREATE
for defining new tables, INSERT
for adding data, and SELECT
for retrieving data. We will focus on SELECT
statements, also called queries, because they are almost exclusively what you’ll use as a data scientist.
Pipe flights
to show_query()
.
flights |> ...()
flights |> show_query()
A query is made up of clauses. There are five important clauses: SELECT
, FROM
, WHERE
, ORDER BY
, and GROUP BY
. Every query must have the SELECT
and FROM
clauses and the simplest query is SELECT * FROM table
, which selects all columns from the specified table.
WHERE
and ORDER BY
control which rows are included and how they are ordered. Insert two new lines in your current (simple) pipe: filter(dest == "IAH")
and arrange(dep_delay)
. Don't forget |>
as the statement separator.
flights |> ...(dest == "IAH") |> arrange(...) |> show_query()
flights |> filter(dest == "IAH") |> arrange(dep_delay) |> show_query()
show_query()
allows us to see how the dplyr code is translated into SQL.
SQL is either pronounced “s”-“q”-“l” or “sequel.”
GROUP BY
converts the query to a summary, causing aggregation to happen. Remove the middle two lines from the previous pipe, replacing them with group_by(dest)
and summarize(dep_delay = mean(dep_delay, na.rm = TRUE))
.
flights |> group_by(...) |> ...(dep_delay = mean(dep_delay, na.rm = TRUE)) |> show_query()
flights |> group_by(dest) |> summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |> show_query()
Study the SQL code returned by show_query()
. Compare it to the original dplyr code.
In SQL, case doesn’t matter: you can write select
, SELECT
, or even SeLeCt
. In this tutorial we’ll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
The SELECT
clause is the workhorse of queries and performs the same job as select()
, mutate()
, rename()
, relocate()
, and, as you’ll learn in a later Exercise, summarize()
.
Start a new pipe with planes
. Second line is select(tailnum, type, manufacturer, model, year)
. Finish the pipe with show_query()
.
planes |> ...(tailnum, type, manufacturer, model, year) |> show_query()
planes |> select(tailnum, type, manufacturer, model, year) |> show_query()
In SQL, order matters: you must always write the clauses in the order SELECT
, FROM
, WHERE
, GROUP BY
, ORDER BY
. Confusingly, this order doesn’t match how the clauses actually evaluated which is first FROM
, then WHERE
, GROUP BY
, SELECT
, and ORDER BY
.
select()
, rename()
, and relocate()
have very direct translations to SELECT
as they just affect where a column appears (if at all) along with its name. Add rename(year_built = year)
to the previous pipeline, right before show_query()
.
planes |> select(tailnum, type, manufacturer, model, year) |> ...(year_built = ...) |> show_query()
planes |> select(tailnum, type, manufacturer, model, year) |> rename(year_built = year) |> show_query()
In the examples above, note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.
When working with other databases you’re likely to see every variable name in quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.
In the code from the previous Exercise, replace rename(year_built = year)
with relocate(manufacturer, model, .before = type)
.
planes |> select(tailnum, type, manufacturer, model, year) |> ...(manufacturer, model, ... = type) |> show_query()
planes |> select(tailnum, type, manufacturer, model, year) |> relocate(manufacturer, model, .before = type) |> show_query()
Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMS’s, there are many minor variations.
The translations for mutate()
are similarly straightforward: each variable becomes a new expression in SELECT
. Pipe flights
to mutate(speed = distance / (air_time / 60))
and then finish with show_query()
.
flights |> mutate(speed = distance / (air_time / 60)) |> show_query()
flights |> mutate(speed = distance / (air_time / 60)) |> show_query()
This example also shows you how SQL does renaming. In SQL terminology, renaming is called aliasing and is done with AS
. Note that unlike mutate()
, the old name is on the left and the new name is on the right.
Let's start working with diamonds_db
again. group_by()
is translated to the GROUP BY
clause and summarize()
is translated to the SELECT
clause. Run this code:
diamonds_db |> group_by(cut) |> summarize( n = n(), avg_price = mean(price, na.rm = TRUE) ) |> show_query()
diamonds_db |> group_by(cut) |> summarize( n = n(), avg_price = mean(price, na.rm = TRUE) ) |> show_query()
Note the use of SQL functions like COUNT()
and AVG()
.
The FROM
clause defines the data source. It’s going to be rather uninteresting for a little while, because we’re just using single tables. You’ll see more complex examples once we hit the join functions.
filter()
is translated to the WHERE
clause. Pipe flights
to filter()
with the argument set to dest == "IAH" | dest == "HOU"
. Finish the pipe with show_query()
.
... |> filter(dest == "..." | ... == "HOU") |> ...()
flights |> filter(dest == "IAH" | dest == "HOU") |> show_query()
Note:
|
becomes OR
and &
becomes AND
.
SQL uses =
for comparison, not ==
. SQL doesn’t have assignment, so there’s no potential for confusion there.
SQL uses only ''
for strings, not ""
. In SQL, ""
is used to identify variables, like R’s ``
.
Another useful SQL operator is IN
, which is very close to R’s %in%
. Use the code from the previous Exercise, but replace dest == "IAH" | dest == "HOU"
with dest %in% c("IAH", "HOU")
.
flights |> filter(... %in% c("IAH", "...")) |> show_query()
flights |> filter(dest %in% c("IAH", "HOU")) |> show_query()
Study the correspondence between the dplyr code and the SQL code produced by show_query()
if you want to start learning about SQL.
SQL uses NULL instead of NA. NULLs behave similarly to NAs. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it. Run this code:
flights |> group_by(dest) |> summarize(delay = mean(arr_delay))
flights |> group_by(dest) |> summarize(delay = mean(arr_delay))
If you want to learn more about how NULLs work, you might enjoy “The Three-Valued Logic of SQL” by Markus Winand.
In general, you can work with NULLs using the functions you’d use for NAs in R. Pipe flights
to filter(!is.na(dep_delay))
and finish the pipe with show_query()
.
flights |> ...(!is.na(...)) |> show_query()
flights |> filter(!is.na(dep_delay)) |> show_query()
This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isn’t as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator that’s easier to read:
WHERE "dep_delay" IS NOT NULL
Note that if you filter()
a variable that you created using a summarize, dbplyr will generate a HAVING
clause, rather than a WHERE
clause. Run this code.
diamonds_db |> group_by(cut) |> summarize(n = n()) |> filter(n > 100) |> show_query()
diamonds_db |> group_by(cut) |> summarize(n = n()) |> filter(n > 100) |> show_query()
This is a one of the idiosyncrasies of SQL: WHERE
is evaluated before SELECT
and GROUP BY
, so SQL needs another clause that’s evaluated afterwards.
Ordering rows involves a straightforward translation from arrange()
to the ORDER BY
clause. Pipe flights
to arrange(year, month, day, desc(dep_delay))
and then complete the pipe with show_query()
.
... |> arrange(year, ..., day, ...(dep_delay)) |> show_query()
flights |> arrange(year, month, day, desc(dep_delay)) |> show_query()
Notice how desc() is translated to DESC: this is one of the many dplyr functions whose name was directly inspired by SQL.
Sometimes it’s not possible to translate a dplyr pipeline into a single SELECT
statement and you need to use a subquery. A subquery is just a query used as a data source in the FROM
clause, instead of the usual table.
Using the previous code, replace arrange(year, month, day, desc(dep_delay))
with
mutate( year1 = year + 1, year2 = year1 + 1 )
flights |> mutate( year1 = ... + 1, ... = year1 + 1 ) |> show_query()
flights |> mutate( year1 = year + 1, year2 = year1 + 1 ) |> show_query()
dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the SELECT
clause can’t refer to columns that were just created. That is why the simple dplyr code needs two steps SQL: the first (inner) query computes year1
and then the second (outer) query can compute year2
.
You’ll also see this if you attempted to filter()
a variable that you just created. Remember, even though WHERE
is written after SELECT
, it’s evaluated before it, so we need a subquery in this (silly) example. Run this code.
flights |> mutate(year1 = year + 1) |> filter(year1 == 2014) |> show_query()
flights |> mutate(year1 = year + 1) |> filter(year1 == 2014) |> show_query()
Sometimes dbplyr will create a subquery where it’s not needed because it doesn’t yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.
If you’re familiar with dplyr’s joins, SQL joins are very similar. Pipe flights
to left_join(planes |> rename(year_built = year), by = "tailnum")
and then complete the pipe with show_query()
.
flights |> ...(planes |> rename(... = year), by = "...") |> show_query()
flights |> left_join(planes |> rename(year_built = year), by = "tailnum") |> show_query()
The main thing to notice here is the syntax: SQL joins use sub-clauses of the FROM
clause to bring in additional tables, using ON
to define how the tables are related.
dplyr’s names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for inner_join()
, right_join()
, and full_join()
.
Go to the console, install and load the dm package. CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
You’re likely to need many joins when working with data from a database. That’s because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys.
Run ?dm
, copy/paste the description.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
If you hit this scenario, the dm package, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see what’s going on, and generate the joins you need to connect one table to another.
Load the dbplyr package and run ?dbplyr
in the console. Copy/paste the description once again.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
dbplyr also translates other verbs like distinct()
, slice_*()
, and intersect()
, and a growing selection of tidyr functions like pivot_longer()
and pivot_wider()
. The easiest way to see the full set of what’s currently available is to visit the dbplyr website.
This section of the chapter is too difficult to capture within the structure of a tutorial. Please just read it.
This tutorial covered Chapter 21: Databases from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to use DBI and used the dbplyr packages to connect to databases and execute SQL.
For further reading consider Writing SQL with dbplyr, Practical SQL: A Beginner's Guide to Storytelling with Data, and SQL For Data Scientists.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.