knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = RPresto::presto_has_default() ) options(pillar.max_dec_width=20)
The Common Table Expressions or CTE’s for short are used within SQL databases to simplify complex joins and subqueries. You can think of them as named subqueries that can be referenced in other parts of the query, including other CTEs (i.e. recursive CTEs).
In Presto, CTEs take the form of WITH clauses. There can be multiple
named CTE within the single WITH clause. They need to be defined before the
main SELECT query.
At the time of this writing (late 2022), DBI APIs don't have an official way
of implementing CTEs yet. dbplyr just begins to introduce CTEs into its APIs
as an experimental feature. So our implementation of CTE support in RPresto is
very much avant-garde and should be used with discretion.
We attach CTEs to the PrestoConnection so that they are available across the
queries executed via the connection.
library(RPresto) library(DBI) library(dplyr) library(dbplyr)
You can check your RPresto version by running the packageVersion() function.
You need version 1.4.0 or later to use the CTE feature.
packageVersion("RPresto")
PrestoConnectionYou can define and attach CTEs while creating a PrestoConnection. Here we
assume that the user already have a Presto server with a memory connector set
up. If you don't have such a server set up, refer to the
Presto documentation for instructions if you want to follow along.
con <- DBI::dbConnect( drv = RPresto::Presto(), host = "http://localhost", port = 8080, user = Sys.getenv("USER"), catalog = "memory", schema = "default", # Define a testing CTE using dummy VALUES ctes = list( "dummy_values" = "SELECT * FROM (VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS t (id, name)" ) )
Now dummy_values is not an existing permanent table available in the
PrestoConnection. It only exists as a temporary feature for the connection.
db_has_table(con, "dummy_values")
We can read the content of the CTE.
dbReadTable(con, "dummy_values")
We can also execute arbitrary SELECT queries on top of the CTE.
dbGetQuery(con, "SELECT id * 2 AS id_2, name FROM dummy_values")
dplyr backendAnother way of leveraging CTEs in your workflow is to incorporate them into the
dplyr workflow.
# We first copy mtcars to Presto and create a remote table on it tbl.mtcars <- copy_to(con, mtcars, "test_mtcars", overwrite = TRUE) tbl.mtcars %>% colnames()
We call a few dplyr verbs on the remote table to mimic a typical analysis
work flow.
tbl.mtcars.transform <- tbl.mtcars %>% mutate(hp2 = pow(hp, 2)) %>% group_by(cyl) %>% mutate(mean_mpg_by_cyl = mean(mpg, na.rm = TRUE))
We can see the underlying SQL query generated so far.
tbl.mtcars.transform %>% show_query()
For illustration, let's say we filter the same transformed table twice on the
cyl field and UNION ALL them together in the next step.
tbl.mtcars.union <- union( filter(tbl.mtcars.transform, cyl == 4L), filter(tbl.mtcars.transform, cyl == 8L), all = TRUE ) tbl.mtcars.union %>% show_query()
The underlying SQL query of the transformed table (i.e. tbl.mtcars.transform)
has to be replicated twice in this step and thus makes the resulting query long
and repetitive. It offers a prime opportunity to simplify using CTEs.
We can "save" the underlying SQL query of the transformed table into a CTE and
use that in the union step by calling the compute() function with
cte = TRUE.
tbl.mtcars.transform <- tbl.mtcars.transform %>% compute(name = "mtcars_transform", cte = TRUE) tbl.mtcars.transform %>% show_query()
Here the content of tbl.mtcars.transform hasn't changed at all and we can
use the remote table as it is just like before. The only change underneath is
that the underlying logic is now captured and stored in a CTE. You can almost
think of it as saving tbl.mtcars.transform as a temporary table named
mtcars_transform and pointing the new remote table on that temporary table.
The difference is that no query has actually been executed yet.
Now we've leveraged CTE, the query for the union step looks more clean and readable.
tbl.mtcars.union <- union( filter(tbl.mtcars.transform, cyl == 4L), filter(tbl.mtcars.transform, cyl == 8L), all = TRUE ) tbl.mtcars.union %>% show_query()
We can even create nested CTEs that depend on other CTEs (Presto calls it
chained CTEs). Below we call compute() on tbl.mtcars.union which already
utilizies the mtcars_transform CTE.
tbl.mtcars.union <- tbl.mtcars.union %>% compute(name = "mtcars_union", cte = TRUE) tbl.mtcars.union %>% show_query()
Now the underlying query of the previous tbl.mtcars.union is saved into the
mtcars_union CTE which in turn depends on the mtcars_transform CTE.
# Clean up dbRemoveTable(con, "test_mtcars")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.