rquery
is a
query generator for R
. It is based on Edgar F. Codd's relational
algebra plus experience using SQL
and dplyr
at big data
scale. The design represents an attempt to make SQL
more teachable by
denoting composition by a sequential pipeline notation instead of nested
queries or functions. The implementation delivers reliable high
performance data processing on large data systems such as Spark
and databases.
Package features include: data processing trees
or pipelines as observable objects (able to report both columns
produced and columns used), optimized SQL
generation as an explicit
user visible modeling step, convenience methods for applying query
trees to in-memory data.frame
s, and low direct package dependencies.
SQL
SQL
is a very powerful data processing (or data engineering) grammar.
Data scientists are well advised to learn to work with SQL
.
An inessential difficulty in using SQL
is SQL
represents
composition of operations by nesting, which can rapidly become
confusing and illegible. This can be overcome by using a
query composer such as rquery
(some more query composers
are listed here).
Let's set up our environment so we can work with examples.
run_vignette <- requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)
library("rquery") library("wrapr") # example database connection db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") RSQLite::initExtension(db) dbopts <- rq_connection_tests(db) print(dbopts) options(dbopts) # copy in example data rq_copy_to( db, 'd', data.frame(v = c(1, -5, 3)), temporary = FALSE, overwrite = TRUE) # produce a hande to existing table d <- db_td(db, "d")
d
is a "table description" which is just
the name of a table and the names of expected columns.
d
does not store data or a database reference (making it
safe to serialize/de-serialize). All rquery
operation trees or
pipelines must start either with a table description or a data.frame
.
We will discuss table
descriptions later.
Note: in examples we use rq_copy_to()
to create data. This is only for the purpose of having
easy portable examples. With big data the data is usually already in the remote database or
Spark system. The task is almost always to connect and work with this pre-existing remote data
and the method to do this is db_td()
,
which builds a reference to a remote table given the table name. The suggested pattern for working with
remote tables is to get inputs via db_td()
and land remote results with materialze()
.
To work with local data one can copy data from memory to the database with rq_copy_to()
and bring back results with execute()
(though be aware
operation on remote non-memory data is rquery
's primary intent).
For our first example we will introduce a new column and perform
a calculation using this column. This is achieved
in SQL
by writing code in one of two styles: defining the
first new column
twice (once to land the value and once to use), or sequencing
two queries by nesting. We will demonstrate both methods.
The define the column twice solution looks like the following.
DBI::dbGetQuery(db, " SELECT *, ABS(v) AS absv, ABS(v) - v AS delta FROM d ")
In SQL
the column absv
is not available for calculation in the
same query that it is produced.
The nested method looks like the following, we produce the column
absv
in one query and then wrap that in another query
to later use the column. For expressions longer than ABS(v)
this
is the preferred solution (until one moves to something like common
table expressions).
DBI::dbGetQuery(db, " SELECT *, absv - v AS delta FROM ( SELECT *, ABS(v) AS absv FROM d ) subtab ")
sql_node()
Using rquery
we can write the SQL
composition using pipe notation
(where composition is written as x %.>% f %.>% g
instead of
g(f(x))
). We are going to use wrapr
dot-pipe instead of
the magrittr
pipe to pick up
a neat feature we will use later (all other examples
will work with the magrittr
pipe). The "%.>%
" glyph can
be bound to a keyboard shortcut for convenience.
The rquery
realization of the above calculation is as follows:
op_tree <- d %.>% sql_node(., "absv" := "ABS(v)") %.>% sql_node(., "delta" := "absv - v") execute(db, op_tree)
The above is what we call "piped SQL
" and represents a major convenience for
users as the details of how to compose the statements are left to the package.
The sql_node()
is a very powerful node. We will use it in our first
few examples and move onto more convenient higher level relational nodes.
We can view the SQL
translation of the operations tree as follows:
cat(to_sql(op_tree, db))
Notice the above translations did not add identifier quotes to our
use of "v
" in "ABS(v)
". This is because the SQL
expression
is not parsed in R
. If we want to identify terms as variables we
can wrap them with as.name()
or quote()
to get the quoting (and other variable
oriented features). The extra SELECT
step to pull data from the inner
table is used by rquery
for important column narrowing steps, and
can actually improve query performance.
op_tree <- d %.>% sql_node(., "absv" := list(list("ABS(", quote(v), ")"))) %.>% sql_node(., "delta" := list(list(quote(absv),"-", quote(v)))) cat(to_sql(op_tree, db))
The list(list())
notation is how we say in R
that we have a
single element list (i.e. one expression) that is built up as a list
of terms. The marking notation is cumbersome, but is not needed when
we move on to relation nodes, which are parsed in R
and can spot
identifiers without additional help.
op_tree
itself is a an object with its own presentation format:
cat(format(op_tree))
The op_tree
supplies an number of important summaries about the proposed query:
column_names(op_tree) tables_used(op_tree) columns_used(op_tree)
We can add nodes to an op_tree
to build larger operator trees (or pipelines).
op_tree2 <- op_tree %.>% sql_node(., "prod" := "absv * delta") cat(format(op_tree2))
However one does not have to use the string notation, wrapr
supplies the helper functions qe()
(quote expression), qae()
(quote assignment expressions), and qc()
(quoting concatenate).
op_tree3 <- op_tree %.>% sql_node(., qae(prod = absv * delta)) cat(format(op_tree3))
And, the op_tree
record keeping can be used to catch potential errors
early in pipeline construction. For example if we try to refer to a non-existent
variable when adding an operator we get an thrown exception (note: a sql_node()
being added must have its variables marked as above for pre-checking to occur,
relational nodes will get this checking automatically).
op_tree4 <- op_tree %.>% sql_node(., "z" := list(list("1 + ", quote(z))))
However, early error checking is not currently available with the qae()
notation, which parts of the
expression are values (versus operators or function names) is not marked in the input.
op_tree4 <- op_tree %.>% sql_node(., qae(z = 1 + z))
`rquery
We can express non-trivial operations in sql_node()
s. For example
we can build a node the calculates for each row how many columns
contain NA
/NULL
as is demonstrated here.
# load up example data d2 <- rq_copy_to( db, 'd2', data.frame(v1 = c(1, 2, NA, 3), v2 = c(NA, "b", NA, "c"), v3 = c(NA, NA, 7, 8), stringsAsFactors = FALSE)) # look at table execute(db, d2) # get list of columns vars <- column_names(d2) print(vars) # build a NA/NULLs per-row counting expression. # names are "quoted" by wrapping them with as.name(). # constants can be quoted by an additional list wrapping. expr <- lapply(vars, function(vi) { list("+ (CASE WHEN (", as.name(vi), "IS NULL ) THEN 1.0 ELSE 0.0 END)") }) expr <- unlist(expr, recursive = FALSE) expr <- c(list(0.0), expr) cat(paste(unlist(expr), collapse = " ")) # instantiate the operator node op_tree_count_null <- d2 %.>% sql_node(., "num_missing" := list(expr)) cat(format(op_tree_count_null)) # examine produced SQL sql <- to_sql(op_tree_count_null, db) cat(sql) # execute execute(db, op_tree_count_null)
And, as this is an important capability, this exact functionality
is wrapped in count_null_cols()
.
# whole process wrapped in convenience node d2 %.>% count_null_cols(., vars, "nnull") %.>% execute(db, .)
There are some helper methods to apply a parameterized SQL
expression to a set
of columns.
# vector of columns we want to work on colset <- qc(v1, v2, v3) # build new names we want as results colterms <- paste0(colset, "_isNA") := colset map_to_char(colterms) # build an apply expression to set of columns query s_tree <- d2 %.>% sql_expr_set(., colterms, "CASE WHEN . IS NULL THEN 1 ELSE 0 END") cat(to_sql(s_tree, db)) execute(db, s_tree)
SQL
firstrquery
is a "SQL
first" system. It is designed to create SQL
queries and dispatch them to remote systems (SQLite
, Spark
, PostgreSQL
,
Redshift
, and other databases) for execution. The execute()
method
can be used with big data by adding a table_name
argument (or also by using the
materialize()
method) to land results in a remote table instead of pulling
them back to R
.
The mantra of SQL
-first is data starts in the database, and stays in the database (i.e., it is too large to
depend on round-tripping through R
). Another important SQL
-first package is cdata
which provides pure SQL
based implementations of operators that generalize pivot/un-pivot, cast/melt, or spread/gather.
The better the database implementation the better rquery
will be, both in terms
of performance and in terms of function (such as the availability of SQL
window functions).
As a convenience rquery
can work with in-memory data.frame
s by sending them to
the SQL
service provider. This provider defaults to RSQlite
or can be
set by setting the global option rquery.rquery_db_executor
. We demonstrate
this below.
old_o <- options(list("rquery.rquery_db_executor" = list(db = db))) data.frame(v = -2:2) %.>% execute(., op_tree)
When using the wrapr
dot pipe the above can be abbreviated as:
data.frame(v = -2:2) %.>% op_tree
The above calculation is managed by wrapr
dot pipe
S3
wrapr_function
extensions.
rquery
operators can be used directly (without any table description nodes)
when working with in-memory data.frame
s.
data.frame(x = 5) %.>% sql_node(., "z" := "sqrt(x)")
The above calculation is triggered by S3
override
of any of print()
, as.data.frame()
and head()
. Remote tables need
an execute()
or materialize()
step to specify the database connection.
rquery
table descriptions are simple objects that store only
the name of a table and expected columns. Any local data or database
table that has at least the set of columns named in the table description
can be used in a given rquery
pipeline.
The table description "d
" we have been using in examples was produced
as a result of moving data to a database by
rq_copy_to()
.
However we can also create a description of an existing database table with
db_td()
or
even build a description by hand with
mk_td()
.
Also one can build descriptions of local or in-memory data.frame
s with
local_td()
.
Using wrapr::qae()
, wrapr::qe()
, the bquote()-.()
notation, and a new rquery-.[]
notation
can make working with [sql_node()](https://winvector.github.io/rquery/reference/sql_node.html)
s much easier (though for many applications I prefer to work
with the relational nodes such as extend()
, project()
, and so on).
The ideas include:
qe()
and qae()
are quoting operators, they capture the text written in them..()
is R
's bquote()
notation for substitution (turning off quoting)..[]
is a new notation meaning: "the thing inside the .[]
is supposed to be a column name" (this is the role quote()
or as.name()
/as.symbol()
were serving in the earlier list based expression examples).This allows for the following.
library("rquery") date_cutoff <- '2017-04-02' td <- mk_td("df", c("cust", "trans_date", "sales")) # misspelling not caught (argh!) tryCatch({ ops <- td %.>% select_rows_se( ., qe(trans_date <= str_to_date(.(date_cutoff), '%Y-%m-%d'))) %.>% sql_node( ., qae(max_date = max(trans_datez)), # trans_date misspelled mods = "GROUP BY cust", orig_columns = F) }, error = function(e) { print(e) }) # misspelling caught tryCatch({ ops <- td %.>% select_rows_se( ., qe(trans_date <= str_to_date(.(date_cutoff), '%Y-%m-%d'))) %.>% sql_node( ., qae(max_date = max(.[trans_datez])), # trans_date misspelled mods = "GROUP BY cust", orig_columns = F) }, error = function(e) { print(e) }) ops <- td %.>% select_rows_se( ., qe(trans_date <= str_to_date(.(date_cutoff), '%Y-%m-%d'))) %.>% sql_node( ., qae(max_date = max(.[trans_date])), mods = "GROUP BY cust", orig_columns = F) cat(to_sql(ops, rquery::rquery_default_db_info()))
The .[]
-notation is just signalling to rquery
which symbols are column names (without
requiring rquery
to fully parse the SQL
fragments). The rquery
relational nodes
get this sort of checking without any additional notation as they do fully parse the R
expressions prior to any SQL
translation.
We can combine .()
and .[]
for even more powerful expressions such as the following.
library("rquery") date_cutoff <- '2017-04-02' td <- mk_td("df", c("cust", "trans_date", "sales")) COL_TO_MAX = as.name("trans_date") NEW_COL = paste0("max_", COL_TO_MAX) GROUP_COL = "cust" ops <- td %.>% select_rows_se( ., qe(trans_date <= str_to_date(.(date_cutoff), '%Y-%m-%d'))) %.>% sql_node( ., qae(.(NEW_COL) := max(.[.(COL_TO_MAX)])), mods = paste("GROUP BY", GROUP_COL), orig_columns = F) cat(to_sql(ops, rquery::rquery_default_db_info()))
rquery
is new package, but it is already proving to be correct (avoiding known data processing issues) and performant.
For working with R
at a big data scale (say using PostgreSQL
or Spark
)
rquery
is the right specialized tool for specifying data manipulation.
For deeper dives into specific topics, please see also:
rquery README
data.table based
implementationoptions(old_o) DBI::dbDisconnect(db)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.