In WinVector/rqdatatable: 'rquery' for 'data.table'

(In-memory variation of the Collecting Expressions example.)

For in-memory operations sequential operations appear to not be a problem, as they do not contribute to query complexity as in our earlier database examples. This emphasizes that in-memory intuition must be confirmed when working with remote systems.

First set up our packages, database connection, and remote table.

library("dplyr")
library("rquery")
library("microbenchmark")
library("ggplot2")
library("WVPlots")
library("rqdatatable")
library("cdata")
library("data.table")


nrow <- 1000000
d <- data.frame(x = seq_len(nrow))

td <- local_td(d)

tbl <- as.tbl(d)

ncol <- 100

rqdatatable/rquery torture function: add r sprintf("%d", ncol) columns to a r sprintf("%d", nrow) row table. This is implemented using data.table in a batch mode.

rquery_fn_batch <- function(d, ncol) {
  expressions <- paste0("x + ", seq_len(ncol))
  names(expressions) <- paste0("x_", seq_len(ncol))
  ops <- local_td(d) %.>%
    extend_se(., expressions) %.>%
    select_rows_nse(., x == 3)
  d %.>% ops
}

rquery_fn_batch(d, 5)[]

The row-selection step is to cut down on the in-memory cost of bringing the result back to R. Obviously we could optimize the example away by pivoting the filter to earlier in the example pipeline. We ask the reader to take this example as a stand-in for a more complicated (though nasty) real-world example where such optimizations are not available.

To break out how much time we are losing in planning/parsing (something we intend to improve) we can more a pre-compiled version of the rquery batch query.

rquery_fn_batch_compiled <- function(d, ncol) {
  expressions <- paste0("x + ", seq_len(ncol))
  names(expressions) <- paste0("x_", seq_len(ncol))
  local_td(d) %.>%
    extend_se(., expressions) %.>%
    select_rows_nse(., x == 3)
}

ops <- rquery_fn_batch_compiled(d, 5)

(d %.>% ops)[]

We can also try a sequentinal version.

rquery_fn_seq <- function(d, ncol) {
  ops <- local_td(d) 
  for(i in seq_len(ncol)) {
    ops <- extend_se(ops, paste0("x_", i) %:=% paste0("x + ", i))
  }
  ops <- select_rows_nse(ops, x == 3)
  d %.>% ops
}

rquery_fn_seq(d, 5)[]

And a pre-compiled rquery sequential pipeline.

rquery_fn_seq_comp <- function(d, ncol) {
  ops <- local_td(d) 
  for(i in seq_len(ncol)) {
    ops <- extend_se(ops, paste0("x_", i) %:=% paste0("x + ", i))
  }
  select_rows_nse(ops, x == 3)
}

ops <- rquery_fn_seq_comp(d, 5)
(d %.>% ops)[]

Same torture for dplyr.

dplyr_fn <- function(tbl, ncol) {
  pipeline <- tbl
  xvar <- rlang::sym("x")
  for(i in seq_len(ncol)) {
    res_i <- rlang::sym(paste0("x_", i))
    pipeline <- pipeline %>%
      mutate(., !!res_i := !!xvar + i)
  }
  pipeline <- pipeline %>%
    filter(., x == 3)
  pipeline %>% compute(.)
}

dplyr_fn(tbl, 5)

We can also collect expressions efficiently using seplyr (seplyr is a thin wrapper over dplyr, so seplyr's method mutate_se() is essentially instructions how to do the same thing in batch using dplyr/rlang).

seplyr_fn <- function(tbl, ncol) {
  expressions <- paste0("x + ", seq_len(ncol))
  names(expressions) <- paste0("x_", seq_len(ncol))
  pipeline <- tbl %>%
    seplyr::mutate_se(., expressions) %>%
    filter(., x == 3)
  pipeline %>% compute(.)
}

seplyr_fn(tbl, 5)

And we can also run with data.table either sequentially (as below) or in batch (which was the rqdatatable result).

data_table_sequential_fn <- function(d, ncol) {
  # make sure we have a clean copy
  dt <- data.table::copy(as.data.table(d))
  for(i in seq_len(ncol)) {
    dt[, paste0("x_", i) := eval(parse(text=paste0("x + ", i)))]
  }
  dt[x==3, ]
}

data_table_sequential_fn(tbl, 5)

Time the functions.

opsbc <- rquery_fn_batch_compiled(d, ncol)
opssc <- rquery_fn_seq_comp(d, ncol)

timings <- microbenchmark(
  rqdatatable_batch = rquery_fn_batch(d, ncol),
  rqdatatable_batch_compiled = { d %.>% opsbc },
  rqdatatable_sequential = rquery_fn_seq(d, ncol),
  rqdatatable_sequential_compiled = { d %.>% opssc },
  dplyr = dplyr_fn(tbl, ncol),
  seplyr = seplyr_fn(tbl, ncol),
  data_table_sequential = data_table_sequential_fn(d, ncol),
  times = 100L)

saveRDS(timings, "CollectExprs_memory_timings.RDS")
print(timings)
#autoplot(timings)

Present the results.

timings <- as.data.frame(timings)
timings$seconds <- timings$time/10^9
timings$method <- factor(timings$expr)
timings$method <- reorder(timings$method, timings$seconds)
WVPlots::ScatterBoxPlotH(timings, "seconds", "method", "task time by method")

tratio <- timings %.>%
  project_nse(., 
              groupby = "method", 
              mean_seconds = mean(seconds)) %.>%
  pivot_to_rowrecs(., 
                   columnToTakeKeysFrom = "method", 
                   columnToTakeValuesFrom = "mean_seconds", 
                   rowKeyColumns = NULL) %.>%
  extend_nse(.,
             ratio = dplyr/rqdatatable_batch)

tratio[]

ratio_str <- sprintf("%.2g", 1/tratio$ratio)

rqdatatable in batch mode is about r ratio_str times slower than dplyr (and the other sequential implementations and even batch implementations) for this task at this scale for this data implementation and configuration. Likely this is due to copying and re-parsing overhead from rqdatatable itself (unlikely to be a data.table issue).

WinVector/rqdatatable documentation built on Aug. 22, 2023, 3:25 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com