knitr::opts_chunk$set(collapse=TRUE, comment="#>") options(dplyr.print_min=5L, dplyr.print_max=5L) cc <- rxSetComputeContext("local") db <- DBI::dbConnect(odbc::odbc(), .connection_string=connStr) if(DBI::dbExistsTable(db, "flights")) DBI::dbRemoveTable(db, "flights") if(DBI::dbExistsTable(db, "flightsQry")) DBI::dbRemoveTable(db, "flightsQry")
One of the key strengths of dplyr is its ability to interoperate with SQL databases. Given a database table as input, dplyr can translate the verbs in a pipeline into a SQL query which is then execute in the database. For large tables, this can often be much more efficient than importing the data and running them locally.
As of version 1.0, dplyrXdf has the ability to leverage this ability with ODBC data sources. If the input to a pipeline is an RxOdbcData
, RxTeradata
or RxSqlServerData
source, dplyrXdf will convert it to a dplyr tbl. The subsequent pipeline will then be run in the database, rather than the data being imported to the local R session. To demonstrate how this works, we'll take the flights dataset and turn it into a table in a SQL Server database, then run a simple pipeline against it.
The verb to copy an RxSqlServerData
source into a table is copy_to
. RevoScaleR doesn't have an exact analogue to the "src" concept in dplyr, so copy_to
works slightly differently in dplyrXdf. To copy a dataset to a SQL database, the first argument to the function should be a data source that references the destination table:
library(dplyrXdf) # also loads dplyr library(nycflights13) # copy the flights dataset to SQL Server flightsSql <- RxSqlServerData("flights", connectionString=connStr) flightsHd <- copy_to(flightsSql, flights)
Once this is done, the rest of the pipeline looks the same as any other dplyr pipeline:
flightsQry <- flightsSql %>% filter(month > 6) %>% group_by(carrier) %>% summarise(avg_delay=mean(arr_delay)) flightsQry
The header in the output tell us that the source is Microsoft SQL Server 2016, with the name of the server and database appended. This is an indication that, rather than the pipeline running inside the R session on the client, it has been converted into a SQL query to be executed on the server. No data is transferred back to the client, except for a small sample used in the printout.
To execute the pipeline and save the output in the database, use the compute
verb. By default, this will create a temporary table (with a random name prefixed with ##
). To create a persistent table, set the temporary
argument to FALSE
.
compute(flightsQry) compute(flightsQry, name="flightsQry", temporary=FALSE)
Going the other way is also possible: starting from a database table, dplyrXdf can import the data to Xdf format. This is done via compute
if you want to create a temporary Xdf file (which will be deleted after you quit R), or as_xdf
if you want a permanent one.
# import a table to the local machine flightsXdf <- as_xdf(flightsSql) flightsXdf %>% filter(month > 6) %>% group_by(carrier) %>% summarise(avg_delay=mean(arr_delay)) %>% head()
While dplyrXdf will do this with any RxOdbcData
, RxTeradata
or RxSqlServerData
source, the ability to run a pipeline in-database depends on the availability of a suitable dplyr backend for the DBMS in question. There are backends for many popular commercial and open-source DBMSes, including SQL Server, PostgreSQL and Apache Hive. A Teradata backend is not yet available, but is in development at the time of writing (September 2017).
It should also be noted that an in-database pipeline is handled by dplyr, not by dplyrXdf. This means that RevoScaleR-specific features, such as the use of a transformFunc
to execute arbitrary R code in the database, will not be available.
For more information on how dplyr executes pipelines against database sources, see the database vignette on the Tidyverse website. Using this functionality does require you to install a few additional packages, namely odbc and dbplyr (and their dependencies).
delete_xdf(flightsXdf) DBI::dbRemoveTable(db, "flightsQry") DBI::dbRemoveTable(db, "flights") DBI::dbDisconnect(db) rxSetComputeContext(cc)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.