In WinVector/rqdatatable: 'rquery' for 'data.table'

In this note we will explain rqdatatable immediate mode. We will also explain why one should still consider building explicit operator trees as a best practice in rqdatatable/rquery.

We are going to show some timings on small data. It has often been pointed out to not tune things on sub-second timings (as there is little point optimizing what is already fast and small examples may not show the same bottle necks as large examples). However, we have confirmed the rqdatatabl in-memory issue scales with data (so is an actual problem) and the remaining timings are consistent with what we have already shared on larger (more expensive) examples.

library("dplyr")
library("dbplyr")
library("data.table")
library("dtplyr")
library("microbenchmark")
library("WVPlots")
library("rqdatatable")


con <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
                      host = 'localhost',
                      port = 5432,
                      user = 'johnmount',
                      password = '')

dbopts <- rq_connection_tests(con)
db_info <- rquery_db_info(connection = con, 
                          is_dbi = TRUE,
                          connection_options = dbopts)

flights <- nycflights13::flights
flights <- do.call(rbind, rep(list(flights), 10))
str(flights)

flights_dt <- data.table::as.data.table(flights)

rq_copy_to(db_info, "flights", flights)

flights_db <- tbl(con, "flights")

In-memory examples

Example adapted from https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html.

dplyr example.

tailnum_delay_dplyr <- flights %>% 
  filter(!is.na(arr_delay)) %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay),
    n = n()
  ) %>% 
  arrange(desc(delay)) %>%
  filter(n > 100) 

head(tailnum_delay_dplyr)

dtplyr example.

class(flights_dt)

tailnum_delay_dtplyr <- flights_dt %>% 
  filter(!is.na(arr_delay)) %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay),
    n = n()
  ) %>% 
  arrange(desc(delay)) %>%
  filter(n > 100) 

class(tailnum_delay_dtplyr)
head(tailnum_delay_dtplyr)

rqdatatable example.

ops <- flights %.>%
  local_td %.>%
  select_rows_nse(., !is.na(arr_delay)) %.>%
  extend_nse(.,
             one = 1) %.>%
  project_nse(., 
              groupby = "tailnum",
              delay = mean(arr_delay),
              n = sum(one)) %.>%
  orderby(., cols = "delay", reverse = "delay") %.>%
  select_rows_nse(., n > 100)

tailnum_delay_rqdatatable <- flights %.>% ops

head(tailnum_delay_rqdatatable)

cat(format(ops))

We will compare four ways of processing the flights data.

dplyr in-memory pipeline.
rqdatatable_precompiled using new data with an existing pipeline definition (the recommended way to use rquery/rqdatatable).
rqdatatable_ops building and then using an rqdatatable pipeline (also recommended).
rqdatatable_immediate a convenience method for using rquery operators directly on in-memory data, without taking the time to pre-define the operator pipeline. We will call this mode "immediate mode". The point is it is a user convenience- but it has some overhead.

timings <- microbenchmark(
  dplyr = nrow(
    flights %>% 
      filter(!is.na(arr_delay)) %>%
      group_by(tailnum) %>%
      summarise(
        delay = mean(arr_delay),
        n = n()
      ) %>% 
      arrange(desc(delay)) %>%
      filter(n > 100) 
  ),
  dtplyr = nrow(
    flights_dt %>% 
      filter(!is.na(arr_delay)) %>%
      group_by(tailnum) %>%
      summarise(
        delay = mean(arr_delay),
        n = n()
      ) %>% 
      arrange(desc(delay)) %>%
      filter(n > 100) 
  ),
  rqdatatable_precompiled = nrow(flights %.>% ops),
  rqdatatable_ops = nrow(
    {
      ops1 <- flights %.>%
        local_td %.>%
        select_rows_nse(., !is.na(arr_delay)) %.>%
        extend_nse(.,
                   one = 1) %.>%
        project_nse(., 
                    groupby = "tailnum",
                    delay = mean(arr_delay),
                    n = sum(one)) %.>%
        orderby(., cols = "delay", reverse = "delay") %.>%
        select_rows_nse(., n > 100)
      nrow(flights %.>% ops1)
    }
  ),
  rqdatatable_immediate = nrow(
    flights %.>%
      select_rows_nse(., !is.na(arr_delay)) %.>%
      extend_nse(.,
                 one = 1) %.>%
      project_nse(., 
                  groupby = "tailnum",
                  delay = mean(arr_delay),
                  n = sum(one)) %.>%
      orderby(., cols = "delay", reverse = "delay") %.>%
      select_rows_nse(., n > 100)
  )
)

print(timings)

timings <- as.data.frame(timings)
timings$seconds <- timings$time/10^9 
timings$method <- factor(timings$expr)
timings$method <- reorder(timings$method, timings$seconds)
WVPlots::ScatterBoxPlotH(timings,  "seconds", "method", "task time by method")

The issue with rqdatatable_immediate is that we are paying a extra overhead copying (possibly wide) intermediate tables to naively convert data.table reference semantics to more R-like value semantics. This copying is repeated at each stage and is without the traditional rquery column liveness optimizations. This is because standard user/package defined operators (such as %.>%) are left to right associative- so pipelines are executed left to right, so rquery/rqdatatable is operating in a fairly blind or degraded mode in this situation. Roughly: rqdatatable in immediate mode is myopic (only can see one stage at a time) and fighting to bridge the difference between data.table and expected R semantics, so there are costs.

Our advice is: use rqdatatable immediate only for convenience. Please get in the habit of building operator trees or pipelines with working with rqdatatable/rquery. rquery is designed assuming the operator tree or pipeline is an finished object before it is given any data. This is a good choice given rquery's "database first" design principles. In normal rquery all operation sequences start with a "data description" and not with data (local_td builds a data description from local data). This is core to how rquery optimizes queries and minimizes copying and translation overhead. Immediate-mode is for ad-hoc work and is solely to save the user the small trouble of saving a pipeline definition (as in demonstrated in rqdatatable_ops).

dplyr, on the other hand, is an in-memory first design. We see it is optimized for in-memory operation. As is often discussed dplyr gets "query as whole" effects by lazy evaluation, however we feel rquery's more explicit user facing management of operator trees is in fact the better choice for database work.

dtplyr currently has the same problem as immediate-mode rqdatatable (to a lesser extent, though in dtplyr's case there is no way to pre-build the operator tree to avoid the overhead). This is seen both in the runtime above and in the following note quoted from https://github.com/hadley/dtplyr/blob/master/README.md:

dtplyr will always be a bit slower than data.table, because it creates copies of objects rather than mutating in place (that's the dplyr philosophy). Currently, dtplyr is quite a lot slower than bare data.table because the methods aren't quite smart enough.

We emphasize the "smart enough" is likely meaning "tracking more state" (such as tracking object visibility to avoid copying) and probably not a pejorative.

Database examples

tailnum_delay_dbplyr <- flights_db %>% 
  filter(!is.na(arr_delay)) %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  ) %>% 
  arrange(desc(delay)) %>%
  filter(n > 100) 

head(tailnum_delay_dbplyr)

db_info %.>% 
  ops %.>%
  head(.)

timingsdb <- microbenchmark(
  dbplyr = flights_db %>%
    filter(!is.na(arr_delay)) %>%
    group_by(tailnum) %>%
    summarise(
      delay = mean(arr_delay, na.rm = TRUE),
      n = n()
    ) %>%
    arrange(desc(delay)) %>%
    filter(n > 100) %>%
    collect(),
  rquery = db_info %.>% 
    ops 
)

print(timingsdb)

timingsdb <- as.data.frame(timingsdb)
timingsdb$seconds <- timingsdb$time/10^9 
timingsdb$method <- factor(timingsdb$expr)
timingsdb$method <- reorder(timingsdb$method, timingsdb$seconds)
WVPlots::ScatterBoxPlotH(timingsdb,  "seconds", "method", "database task time by method")