knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
RPresto is a DBI-based adapter for the open source distributed SQL query engine Presto for running interactive analytic queries.
RPresto is both on CRAN and github.
For the CRAN version, you can use
install.packages("RPresto")
You can install the development version of RPresto from GitHub with:
# install.packages("devtools") devtools::install_github("prestodb/RPresto")
The following examples assume that you have a in-memory Presto server set up locally. It's the simplest server which stores all data and metadata in RAM on workers and both are discarded when Presto restarts. If you don't have one set up, please refer to the memory connector documentation.
# Load libaries and connect to Presto library(RPresto) library(DBI) con <- DBI::dbConnect( drv = RPresto::Presto(), host = "http://localhost", port = 8080, user = Sys.getenv("USER"), catalog = "memory", schema = "default" )
There are two levels of APIs: DBI
and dplyr
.
DBI
APIsThe easiest and most flexible way of executing a SELECT
query is using a
dbGetQuery()
call. It returns the query result in a tibble
.
DBI::dbGetQuery(con, "SELECT CAST(3.14 AS DOUBLE) AS pi")
dbWriteTable()
can be used to write a small data frame into a Presto
table.
if (DBI::dbExistsTable(con, "iris")) { DBI::dbRemoveTable(con, "iris") }
# Writing iris data frame into Presto DBI::dbWriteTable(con, "iris", iris)
dbExistsTable()
checks if a table exists.
DBI::dbExistsTable(con, "iris")
dbReadTable()
reads the entire table into R. It's essentially a SELECT *
query on the table.
DBI::dbReadTable(con, "iris")
dbRemoveTable()
drops the table from Presto.
DBI::dbRemoveTable(con, "iris")
You can execute a statement and returns the number of rows affected using
dbExecute()
.
if (DBI::dbExistsTable(con, "testing_table")) { DBI::dbRemoveTable(con, "testing_table") }
# Create an empty table using CREATE TABLE DBI::dbExecute( con, "CREATE TABLE testing_table (field1 BIGINT, field2 VARCHAR)" )
dbExecute()
returns the number of rows affected by the statement. Since a
CREATE TABLE
statement creates an empty table, it returns 0.
DBI::dbExecute( con, "INSERT INTO testing_table VALUES (1, 'abc'), (2, 'xyz')" )
Since 2 rows are inserted into the table, it returns 2.
# Check the previous INSERT statment works DBI::dbReadTable(con, "testing_table")
dplyr
APIsWe also include dplyr
database backend integration (which is mainly
implemented using the dbplyr
package).
# Load packages library(dplyr) library(dbplyr)
We can use dplyr::copy_to()
to write a local data frame to a
PrestoConnection
and immediately create a remote table on it.
# Add mtcars to Presto if (DBI::dbExistsTable(con, "mtcars")) { DBI::dbRemoveTable(con, "mtcars") } tbl.mtcars <- dplyr::copy_to(dest = con, df = mtcars, name = "mtcars") # colnames() gives the column names tbl.mtcars %>% colnames()
dplyr::tbl()
also work directly on the PrestoConnection
.
# Treat "iris" in Presto as a remote data source that dplyr can now manipulate if (!DBI::dbExistsTable(con, "iris")) { DBI::dbWriteTable(con, "iris", iris) } tbl.iris <- dplyr::tbl(con, "iris") tbl.iris %>% colnames() # dplyr verbs can be applied onto the remote data source tbl.iris %>% group_by(species) %>% summarize( mean_sepal_length = mean(sepal.length, na.rm = TRUE) ) %>% arrange(species) %>% collect()
BIGINT
handlingRPresto's handling of BIGINT
(i.e. 64-bit integers) is similar to other DBI
packages (e.g. bigrquery
, RPostgres
). We provide a bigint
argument that users can use in multiple interfaces to specify how they want
BIGINT
typed data to be translated into R.
The bigint
argument takes one of the following 4 possible values.
bigint = "integer"
is the default setting. It translates BIGINT
to R's
native integer type (i.e. 32-bit integer). The range of 32-bit integer is
[-2,147,483,648, 2,147,483,647]
which should cover most integer use cases.
In case that you need to represent integer values outside of the 32-bit
integer range, you have 2 options: bigint = "numeric"
which translates the
number into a double
floating-point type in R; and bigint = "integer64"
which packages the number using the bit64::integer64
class. Note that both of
those two approaches actually the same precision-preservation range:
+/-(2^53-1) = +/-9,007,199,254,740,991
, due to the fact that the Presto REST
API uses JSON to encode the number and JSON has a limit at 53 bits (rather than
64 bits).
bigint = "character"
casts the number into a string. This is most useful
when BIGINT
is used to represent an ID rather than a real arithmetic number.
bigint
The DBI interface function dbGetQuery()
is the most fundamental interface
whereby bigint
can be specified. All other interfaces are either built on top
of dbGetQuery()
or only take effect when used with dbGetQuery()
.
# BIGINT within the 32-bit integer range is simply translated into integer DBI::dbGetQuery(con, "SELECT CAST(1 AS BIGINT) AS small_bigint") # BIGINT outside of the 32-bit integer range generates a warning and returns NA # when bigint is not specified DBI::dbGetQuery(con, "SELECT CAST(POW(2, 31) AS BIGINT) AS overflow_bigint") # Using bigint to specify numeric or integer64 translations DBI::dbGetQuery( con, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_numeric", bigint = "numeric" ) DBI::dbGetQuery( con, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_integer64", bigint = "integer64" )
When used with the dplyr
interface, bigint
can be specified in two places.
bigint
argument to dbConnect()
when creating the
PrestoConnection
. All queries that use the connection later will use the
specified bigint
setting.con.bigint <- DBI::dbConnect( drv = RPresto::Presto(), host = "http://localhost", port = 8080, user = Sys.getenv("USER"), catalog = "memory", schema = "default", # bigint can be specified in dbConnect bigint = "integer64" ) # BIGINT outside of the 32-bit integer range is automatically translated to # integer64, per the connection setting earlier DBI::dbGetQuery( con.bigint, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_integer64" )
bigint
for a particular query when using the
dplyr
interface without affecting other queries, you can pass bigint
to the
collect()
call.tbl.bigint <- dplyr::tbl( con, sql("SELECT CAST(POW(2, 31) AS BIGINT) AS bigint") ) # Default collect() generates a warning and returns NA dplyr::collect(tbl.bigint) # Passing bigint to collect() specifies BIGINT treatment dplyr::collect(tbl.bigint, bigint = "integer64")
To connect to Trino you must set the use.trino.headers
parameter so RPresto
knows to send the correct headers to the server. Otherwise all the same
functionality is supported.
con.trino <- DBI::dbConnect( RPresto::Presto(), use.trino.headers = TRUE, host = "http://localhost", port = 8080, user = Sys.getenv("USER"), schema = "<schema>", catalog = "<catalog>", source = "<source>" )
To pass extraCredentials that gets added to the X-Presto-Extra-Credential
header use the extra.credentials
parameter so RPresto
will add that to the
header while creating the PrestoConnection
.
Set use.trino.headers
if you want to pass extraCredentials through the
X-Trino-Extra-Credential
header.
con <- DBI::dbConnect( RPresto::Presto(), host = "http://localhost", port = 7777, user = Sys.getenv("USER"), schema = "<schema>", catalog = "<catalog>", source = "<source>", extra.credentials = "test.token.foo=bar", )
Assuming that you have Kerberos already set up with the Presto coordinator, you
can use RPresto
with Kerberos by passing a Kerberos config header to the
request.config
argument of dbConnect()
. We provide a convenient wrapper
kerberos_configs()
to further simplify the workflow.
con <- DBI::dbConnect( RPresto::Presto(), host = "http://localhost", port = 7777, user = Sys.getenv("USER"), schema = "<schema>", catalog = "<catalog>", source = "<source>", request.config = kerberos_configs() )
Presto exposes its interface via a REST based API[^1]. We utilize the
httr package to make the API calls and
use jsonlite to reshape the
data into a tibble
.
RPresto has been tested on Presto 0.100.
RPresto is BSD-licensed.
[^1]: See https://github.com/prestodb/presto/wiki/HTTP-Protocol for a description of the API.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.