rquery: Relational Query Generator for Data Manipulation at Scale

What is new for rquery December 2019

Our goal has been to make rquery the best query generation system for R (and to make data_algebra the best query generator for Python).

Lets see what rquery is good at, and what new features are making rquery better.

The idea is: the query is a first class citizen that we can use to design and optimize queries prior to translating them into a data transform action via data.table, SQL, Pandas, or other realizations.

For quite a while rquery has had query-narrowing. Columns that are not used in the calculation are eliminated early. Here is an example.

library(rquery)

ops <- mk_td(
  "d", 
  c("col1", "col2", "col3")) %.>%
  extend(., sum23 := col2 + col3) %.>%
  select_columns(., 'sum23')

cat(format(ops))

## mk_td("d", c(
##   "col1",
##   "col2",
##   "col3")) %.>%
##  extend(.,
##   sum23 := col2 + col3) %.>%
##  select_columns(., 
##     c('sum23'))

The above query (or operator DAG) represents working with a table that has columns col1, col2, col3. The example is specifying adding a new derived column named sum23 and then limiting down to only this new column. We’ve tried to use operator names that evoke operator names used by Codd.

An important point is: the query is bound to a description of a data frame (or a schema), not bound to any one data frame. Thus we can re-use the query on new data.

The record-keeping in the query knows that only columns col2 and col2 are used.

columns_used(ops)

## $d
## [1] "col2" "col3"

This allows “query narrowing” where the unused columns are not specified in intermediate queries. This is easiest to see if we convert the query to SQL.

ops %.>%
  to_sql(
    ., 
    rquery::rquery_default_db_info()) %.>%
  cat(.)

## SELECT
##  "sum23"
## FROM (
##  SELECT
##   "col2" + "col3"  AS "sum23"
##  FROM (
##   SELECT
##    "col2",
##    "col3"
##   FROM
##    "d"
##   ) tsql_76973382323412881950_0000000000
## ) tsql_76973382323412881950_0000000001

Notice col1 is never referred to. This can be handy when working with tables with hundreds of columns.

And, using rqdatatable we can use data.table as another data action implementation.

library(rqdatatable)

data.frame(col1 = 1, col2 = 2, col3 = 3) %.>%
  ops %.>%
  knitr::kable(.)

| sum23 | | ----: | | 5 |

rquery now also has query-shortening. Some dead-values can be eliminated during query construction, before any calculations are attempted.

ops <- mk_td(
  "example_table", 
  c("col1", "col2", "col3")) %.>%
  extend(., sum23 := col2 + col3) %.>%
  extend(., x := 1) %.>%
  extend(., x := 2) %.>%
  extend(., x := 3) %.>%
  extend(., x := 4) %.>%
  extend(., x := 5) %.>%
  select_columns(., c('x', 'sum23'))

cat(format(ops))

## mk_td("example_table", c(
##   "col1",
##   "col2",
##   "col3")) %.>%
##  extend(.,
##   sum23 := col2 + col3,
##   x := 5) %.>%
##  select_columns(., 
##     c('x', 'sum23'))

Obviously nobody would construct such a bad query, but it is nice that some of the “ick” is optimized automatically.

Both of the above optimizations are deliberately conservative. They are implemented to be correct (not give incorrect results), but are not necessarily super aggressive in eliminating all redundancy.

It is a bit long and technical. But both of these optimizations are easy due to the use of category theoretic ideas in the design of the rquery and data_algebra packages (I am working on some notes on this here).

The short form is: the rquery/data_algebra operators have an interpretation in a nice category over table schemas. The schema objects give us pre-condition and post-condition record keeping which enforces correct query composition and query narrowing. The generality of arrow composition gives us the freedom to place optimizations in the composition step. This gives us more options then systems that are restricted to list-concatenation or function composition/abstraction as their notion of composition. It also lets us enforce and check conditions early.

rquery performs most of its checking during query construction. This can catch errors early and save a lot of development time.

ops_bad <- mk_td(
  "example_table", 
  c("col1", "col2", "col3")) %.>%
  extend(., sum23 := col2_MISSPELLED + col3)

## Error in check_have_cols(src_columns, required_cols, "rquery::extend"): rquery::extend unknown columns col2_MISSPELLED

Notice an error was raised during query construction. We didn’t have to wait to supply data or translate to SQL.

Let’s take a look at the SQL translation of our final example query.

ops %.>%
  to_sql(
    ., 
    rquery::rquery_default_db_info()) %.>%
  cat(.)

## SELECT
##  "x",
##  "sum23"
## FROM (
##  SELECT
##   "col2" + "col3"  AS "sum23",
##   5  AS "x"
##  FROM (
##   SELECT
##    "col2",
##    "col3"
##   FROM
##    "example_table"
##   ) tsql_28722584463189084716_0000000000
## ) tsql_28722584463189084716_0000000001

There are some more things we would wish optimized away, such as both the inner and outer select. But the SQL is reasonably short, due to the intermediate stages that were optimized out of the original query. Later versions of the system will pick these up, and likely these are also easy for downstream SQL optimizers to eliminate.

An important point: optimizations performed during query construction are shared among all back-ends: data.table, SQL, and Pandas.

Please consider giving rquery a try.

We often get asked “why bother with rquery, given dplyr was first.” I’d say: if you are happy with dplyr don’t worry about rquery. Though I would add: you really owe it to yourself to check out data.table, it is by far the best data manipulation system in R.

However, let’s take a look how dbplyr generates a similar SQL query.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

packageVersion("dplyr")

## [1] '0.8.3'

library(dbplyr)

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

packageVersion("dbplyr")

## [1] '1.4.2'

con <- DBI::dbConnect(RSQLite::SQLite(), 
                      ":memory:")
copy_to(
  con, 
  data.frame(col1 = 1, col2 = 2, col3 = 3), 
  name = 'd')

tbl(con, 'd') %>%
  mutate(sum23 := col2 + col3) %>%
  mutate(x := 1) %>%
  mutate(x := 2) %>%
  mutate(x := 3) %>%
  mutate(x := 4) %>%
  mutate(x := 5) %>%
  select(x, sum23) %>%
  show_query()

## <SQL>
## SELECT 5.0 AS `x`, `sum23`
## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 4.0 AS `x`
## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 3.0 AS `x`
## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 2.0 AS `x`
## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 1.0 AS `x`
## FROM (SELECT `col1`, `col2`, `col3`, `col2` + `col3` AS `sum23`
## FROM `d`)))))

The dplyr SQL query appears to have neither query narrowing nor query shortening. Again, a downstream SQL optimizer may be able to eliminate these steps (or it may not). However, it also would be desirable to have these sort of eliminations available when using data.table through dtplyr.

Also, dbplyr does not seem to catch errors until compute() or print() are called.

tbl(con, 'd') %>%
  mutate(sum23 := col2_MISSPELLED + col3) %>%
  show_query()

## <SQL>
## SELECT `col1`, `col2`, `col3`, `col2_MISSPELLED` + `col3` AS `sum23`
## FROM `d`

The above SQL refers to a non-existent column col2_MISSPELLED. The query construction and SQL generation steps did not signal any error. Depending on how many queries and steps are before this, this could delay finding this mistake by quite a while (especially when using a high latency SQL engine such as Apache Spark).