In this note we will explain rqdatatable
immediate mode. We will also explain why one should still consider building explicit operator trees as a best practice in rqdatatable
/rquery
.
We are going to show some timings on small data. It has often been pointed out to not tune things on sub-second timings (as there is little point optimizing what is already fast and small examples may not show the same bottle necks as large examples). However, we have confirmed the rqdatatabl
in-memory issue scales with data (so is an actual problem) and the remaining timings are consistent with what we have already shared on larger (more expensive) examples.
library("dplyr") library("dbplyr") library("data.table") library("dtplyr") library("microbenchmark") library("WVPlots") library("rqdatatable") con <- DBI::dbConnect(RPostgreSQL::PostgreSQL(), host = 'localhost', port = 5432, user = 'johnmount', password = '') dbopts <- rq_connection_tests(con) db_info <- rquery_db_info(connection = con, is_dbi = TRUE, connection_options = dbopts)
flights <- nycflights13::flights flights <- do.call(rbind, rep(list(flights), 10)) str(flights) flights_dt <- data.table::as.data.table(flights) rq_copy_to(db_info, "flights", flights) flights_db <- tbl(con, "flights")
Example adapted from https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html.
dplyr
example.
tailnum_delay_dplyr <- flights %>% filter(!is.na(arr_delay)) %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n() ) %>% arrange(desc(delay)) %>% filter(n > 100) head(tailnum_delay_dplyr)
dtplyr
example.
class(flights_dt) tailnum_delay_dtplyr <- flights_dt %>% filter(!is.na(arr_delay)) %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n() ) %>% arrange(desc(delay)) %>% filter(n > 100) class(tailnum_delay_dtplyr) head(tailnum_delay_dtplyr)
rqdatatable
example.
ops <- flights %.>% local_td %.>% select_rows_nse(., !is.na(arr_delay)) %.>% extend_nse(., one = 1) %.>% project_nse(., groupby = "tailnum", delay = mean(arr_delay), n = sum(one)) %.>% orderby(., cols = "delay", reverse = "delay") %.>% select_rows_nse(., n > 100) tailnum_delay_rqdatatable <- flights %.>% ops head(tailnum_delay_rqdatatable) cat(format(ops))
We will compare four ways of processing the flights data.
dplyr
in-memory pipeline.rqdatatable_precompiled
using new data with an existing pipeline definition (the recommended way to use rquery
/rqdatatable
).rqdatatable_ops
building and then using an rqdatatable
pipeline (also recommended).rqdatatable_immediate
a convenience method for using rquery
operators directly on in-memory data, without taking the time to pre-define the operator pipeline. We will call this mode "immediate mode". The point is it is a user convenience- but it has some overhead.timings <- microbenchmark( dplyr = nrow( flights %>% filter(!is.na(arr_delay)) %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n() ) %>% arrange(desc(delay)) %>% filter(n > 100) ), dtplyr = nrow( flights_dt %>% filter(!is.na(arr_delay)) %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n() ) %>% arrange(desc(delay)) %>% filter(n > 100) ), rqdatatable_precompiled = nrow(flights %.>% ops), rqdatatable_ops = nrow( { ops1 <- flights %.>% local_td %.>% select_rows_nse(., !is.na(arr_delay)) %.>% extend_nse(., one = 1) %.>% project_nse(., groupby = "tailnum", delay = mean(arr_delay), n = sum(one)) %.>% orderby(., cols = "delay", reverse = "delay") %.>% select_rows_nse(., n > 100) nrow(flights %.>% ops1) } ), rqdatatable_immediate = nrow( flights %.>% select_rows_nse(., !is.na(arr_delay)) %.>% extend_nse(., one = 1) %.>% project_nse(., groupby = "tailnum", delay = mean(arr_delay), n = sum(one)) %.>% orderby(., cols = "delay", reverse = "delay") %.>% select_rows_nse(., n > 100) ) ) print(timings) timings <- as.data.frame(timings) timings$seconds <- timings$time/10^9 timings$method <- factor(timings$expr) timings$method <- reorder(timings$method, timings$seconds) WVPlots::ScatterBoxPlotH(timings, "seconds", "method", "task time by method")
The issue with rqdatatable_immediate
is that we are paying a extra overhead copying (possibly wide) intermediate tables to naively convert data.table
reference semantics to more R
-like value semantics. This copying is repeated at each stage and is without the traditional rquery
column liveness optimizations. This is because standard user/package defined operators (such as %.>%
) are left to right associative- so pipelines are executed left to right, so rquery
/rqdatatable
is operating in a fairly blind or degraded mode in this situation. Roughly: rqdatatable
in immediate mode is myopic (only can see one stage at a time) and fighting to bridge the difference between data.table
and expected R
semantics, so there are costs.
Our advice is: use rqdatatable
immediate only for convenience. Please get in the habit of building operator trees or pipelines with working with rqdatatable
/rquery
. rquery
is designed assuming the operator tree or pipeline is an finished object before it is given any data. This is a good choice given rquery
's "database first" design principles. In normal rquery
all operation sequences start with a "data description" and not with data (local_td
builds a data description from local data). This is core to how rquery
optimizes queries and minimizes copying and translation overhead. Immediate-mode is for ad-hoc work and is solely to save the user the small trouble of saving a pipeline definition (as in demonstrated in rqdatatable_ops
).
dplyr
, on the other hand, is an in-memory first design. We see it is optimized for in-memory operation. As is often discussed dplyr
gets "query as whole" effects by lazy evaluation, however we feel rquery
's more explicit user facing management of operator trees is in fact the better choice for database work.
dtplyr
currently has the same problem as immediate-mode rqdatatable
(to a lesser extent, though in dtplyr
's case there is no way to pre-build the operator tree to avoid the overhead). This is seen both in the runtime above and in the following note quoted from https://github.com/hadley/dtplyr/blob/master/README.md:
dtplyr
will always be a bit slower thandata.table
, because it creates copies of objects rather than mutating in place (that's thedplyr
philosophy). Currently,dtplyr
is quite a lot slower than baredata.table
because the methods aren't quite smart enough.
We emphasize the "smart enough" is likely meaning "tracking more state" (such as tracking object visibility to avoid copying) and probably not a pejorative.
tailnum_delay_dbplyr <- flights_db %>% filter(!is.na(arr_delay)) %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay, na.rm = TRUE), n = n() ) %>% arrange(desc(delay)) %>% filter(n > 100) head(tailnum_delay_dbplyr)
db_info %.>% ops %.>% head(.)
timingsdb <- microbenchmark( dbplyr = flights_db %>% filter(!is.na(arr_delay)) %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay, na.rm = TRUE), n = n() ) %>% arrange(desc(delay)) %>% filter(n > 100) %>% collect(), rquery = db_info %.>% ops ) print(timingsdb) timingsdb <- as.data.frame(timingsdb) timingsdb$seconds <- timingsdb$time/10^9 timingsdb$method <- factor(timingsdb$expr) timingsdb$method <- reorder(timingsdb$method, timingsdb$seconds) WVPlots::ScatterBoxPlotH(timingsdb, "seconds", "method", "database task time by method")
DBI::dbDisconnect(con)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.