old_opts <- options(width = 90)
Query database tables and views over a DBI connection using data.table
's
[i, j, by]
syntax, attach database schemas to the search path, and
programmatically load database catalogs.
This vignette assumes that you are already fluent with data.table
's syntax and
that you know how to open a database connection using the DBI
package.
If you haven't alreay done so, install the dbi.table
package from CRAN.
install.packages("dbi.table")
This section uses the Chinook Database (included in the package) to demonstrate how to
dbi.table
using the dbi.table
function,dbi.table
using data.table
's [i, j, by]
syntax,dbi.attach
function, anddbi.catalog
function.The function chinook.duckdb
returns an open duckdb
(DBI) connection to
the sample Chinook Database. This connection is a typical DBI connection as
returned by DBI::dbConnect
that can be used as the conn
argument in DBI
package functions. Let's get started by loading the package and opening the
connection.
library(data.table) #needed for as.data.table library(dbi.table) chinook <- chinook.duckdb()
dbi.table
The dbi.table
function takes 2 arguments: a DBI connection, and an Id
indentifying a database table or view.
my_album <- dbi.table(chinook, DBI::Id("Album"))
The object my_album
is a dbi.table
, a data structure that represents an SQL
query (which we refer to as the dbi.table
's underlying SQL query). The
print
method displays a preview of the underlying SQL query.
#print(my_album)
my_album
The preview has a format similar to a data.table
with two notable exceptions.
The row numbers are omitted. SQL queries do not necessarily return the
result set in a reliable order (even on subsequent evaluations of the same
query), and dbi.table
does not make any extra effort to order the rows by
default.
Only the first 5 rows of the dbi.table
are displayed (data.table
displays
the first 5 and the last 5). Again, since the result set does not have a
reliable order, it is not possible to say which rows are the first and which
are the last. The rows displayed are the first 5 returned by the RDBMS.
The function as.data.table
executes the dbi.table
's underlying SQL query and
retrieves the result set as a data.table
. Pro tip: calling the extract method
([]
) with no arguments is a shortcut for as.data.table
.
#as.data.table(my_album)
my_album[]
Since the result set is instantiated locally as a data.table
, the row numbers
and the last 5 rows are displayed.
Note: by default, as.data.table
(and its []
shortcut) fetches a
maximum of 10,000 rows. To override this limit, either set the option
dbitable.max.fetch
or call as.data.table
and provide the n
argument
(e.g., n = -1
to fetch the entire result set).
Lastly, the csql
function displays the dbi.table
's underlying SQL query.
csql(my_album)
The underlying SQL query of a newly created dbi.table
selects all the columns
from the database table.
dbi.table
using data.table
SyntaxThis extract from data.table
's Introduction to data.table vignette pretty much
sums up what dbi.table
does.
DT[i, j, by] ## R: i j by ## SQL: where | order by select | update group by
In general, dbi.table
should be able to handle basic data.table
syntax. SQL
translation is done by dbplyr::translate_sql_
which works with a variety
of R functions. However, complicated expressions (e.g., custom functions in j
,
nested aggregation functions, most special symbols) do not work.
Best practice is to use dbi.table
to subset and wrangle on the database, then
data.table
to fine tune locally.
When i
is a logical expression of the variables in the dbi.table
then it
becomes the WHERE clause in the dbi.table
's underlying SQL query.
csql(my_album[AlbumId == ArtistId + 1])
my_album[AlbumId == ArtistId + 1]
When i
is a call to order
(or chorder
), it becomes the ORDER BY clause
in the dbi.table
's underlying SQL query.
csql(my_album[order(nchar(Title), -AlbumId)])
my_album[order(nchar(Title), -AlbumId)]
When j
is a list of expressions of the variables in the dbi.table
, then j
becomes the SELECT clause in the dbi.table
's underlying SQL query.
my_album[, .(AlbumId, Title)]
When by
is a list of expressions of the variables in the dbi.table
, then
by
becomes the GROUP BY clause in the dbi.table
's underlying SQL query.
csql(my_album[, .("# of Albums" = .N), .(ArtistId)])
my_album[, .("# of Albums" = .N), .(ArtistId)]
The dbi.attach
function attaches a DBI connection to the search path. This
means that dbi.attach
creates a dbi.table
for each table and view in the
schema associated with the DBI connection, then assigns these dbi.table
s to an
environment on the search path.
dbi.attach(chinook)
A quick look at the search path shows the database attached in position 2.
head(search(), 3)
The tables and views in the database schema are queriable as dbi.table
s in the
attached environment r search()[[2L]]
.
ls("duckdb:chinook_duckdb")
Note: Attaching a DBI connection is intended for an interactive exploratory analysis of a database (schema). For programatic use cases, see the Load a Database Catalog section.
dbi.table
sTwo dbi.table
s that share the same connection (for example, all the
dbi.table
s in an attached schema share the same connection) can be merged.
Merging two dbi.table
s results in an SQL join that describes the same result
set as the associated data.table
merge. That is,
merge(as.data.table(Album), as.data.table(Artist), by = "ArtistId")
and
as.data.table(merge(Album, Artist, by = "ArtistId"))
are the same data.table
up to row order.
csql(merge(Album, Artist, by = "ArtistId"))
If dbi.table
can determine the foreign key constraints between x
and y
,
and if there is only one foreign key (either x
referring to y
, or y
referring to x
), then this foreign key is used as the default by
when
merging. Otherwise, dbi.table
uses the same algorithm to determine the default
by
columns as data.table
.
csql(merge(Customer, Employee))
In this case, the Customer
table has a foreign key SupportRepId
that refers
to the Employee
table's primary key EmployeeId
.
When the y
argument is omitted, dbi.table
's merge
uses the foreign key
constraints that x
references to determine the y
(or y
s) to merge with.
When y
is missing, merge.dbi.table
merges with the tables referenced by
x
's foreign keys (one merge for each foreign key).
csql(merge(Track))
When the optional recursive
argument is TRUE
, merge.dbi.table
calls
merge
with recuresive = TRUE
on each referenced table and merges x
with the result. In this example, Track
has a foreign key that references
Album
and Album
has a foreign key that references Artist
. The columns
from Artist
are included in the result set when merge
is called with
recuresive = TRUE
.
csql(merge(Track, recursive = TRUE))
As a best practice for programatic use, it is better to load the catalog in order to avoid modifying the search path.
catalog <- dbi.catalog(chinook)
Printing the catalog lists its schemas.
catalog
Individual tables can be accessed using catalog$schema$table
syntax.
catalog$main$Album
This section provides a brief explanation of what the dbi.table
package is
trying to do.
Suppose that x
is a dbi.table
and that e
is an expression involving x
that returns either a dbi.table
or a data.table
.
x <- dbi.table(chinook, DBI::Id("Album")) e <- quote(x[, .("# of Albums" = .N), .(ArtistId)])
Since dbi.table
's syntax is a subset of data.table
's syntax, if e
can be
evaluated successfully (i.e., eval(e)
does not throw an error), then e
should also be able to be successfully evaluated when x
is a data.table
.
There are thus 2 paths to the final data.table
result:
evaluate e
then coerce the result using as.data.table
, or
coerce x
to a data.table
then evaluate e
.
Path 2 is referred to as the reference implementation and describes the
correct answer: the reference result set. The design goal of dbi.table
is
to get the same result set as the reference result set, up to row order.
result_set <- as.data.table(eval(e)) x <- as.data.table(x) reference_result_set <- eval(e) all.equal(reference_result_set, result_set, ignore.row.order = TRUE)
The dbi.table
package includes the function reference.test
that compares the
result set to the reference result set in the more general case where expr
(the function's first argument) is an expression involving 1 or more
dbi.table
s.
x <- dbi.table(chinook, DBI::Id("Album")) reference.test({ x[, .("# of Albums" = .N), .(ArtistId)] })
This function is used extensively in dbi.table
's unit/regression tests.
We used the chinook.duckdb
function to open a DBI connection at the beginning
of this vignette and now it is up to us to close it.
DBI::dbDisconnect(chinook)
However, this leaves our R session in a wonky state. The environment
"duckdb:chinook_duckdb" is still attached and there are several dbi.table
s in
the global environment - all of these dbi.table
s are associated with an
invalid DBI connection.
#A dbi.table in the duckdb:chinook_duckdb environment
Genre
The R objects associated with our now-closed DBI connection need to be cleaned up manually (or you could just restart R).
detach("duckdb:chinook_duckdb") rm(catalog, my_album, x)
Alternatively, when using either dbi.attach
or dbi.catalog
, the first
arguement can be a zero-argument function that returns an open DBI connection.
When dbi.table
uses a function to open the DBI connection, then that
connection belongs to dbi.table
and dbi.table
will take care of closing it
when it is no longer needed.
dbi.attach(chinook.duckdb)
When dbi.table
is managing the connection, then all the user has to do is
detach (or delete if a catalog). The DBI connection will be closed when the
object is garbage collected.
detach("duckdb:chinook_duckdb")
Further, when dbi.table
owns the connection, it is able to reconnect in the
event that the connection unexpectedly drops.
options(old_opts)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.