(In-memory variation of the Collecting Expressions example.)
For in-memory operations sequential operations appear to not be a problem, as they do not contribute to query complexity as in our earlier database examples. This emphasizes that in-memory intuition must be confirmed when working with remote systems.
First set up our packages, database connection, and remote table.
library("dplyr") library("rquery") library("microbenchmark") library("ggplot2") library("WVPlots") library("rqdatatable") library("cdata") library("data.table") nrow <- 1000000 d <- data.frame(x = seq_len(nrow)) td <- local_td(d) tbl <- as.tbl(d) ncol <- 100
rqdatatable
/rquery
torture function: add r sprintf("%d", ncol)
columns to a r sprintf("%d", nrow)
row table. This is implemented using data.table
in a batch mode.
rquery_fn_batch <- function(d, ncol) { expressions <- paste0("x + ", seq_len(ncol)) names(expressions) <- paste0("x_", seq_len(ncol)) ops <- local_td(d) %.>% extend_se(., expressions) %.>% select_rows_nse(., x == 3) d %.>% ops } rquery_fn_batch(d, 5)[]
The row-selection step is to cut down on the in-memory cost of bringing the result back to R
. Obviously we could optimize the example away by pivoting the filter to earlier in the example pipeline. We ask the reader to take this example as a stand-in for a more complicated (though nasty) real-world example where such optimizations are not available.
To break out how much time we are losing in planning/parsing (something we intend to improve) we can more a pre-compiled version of the rquery
batch query.
rquery_fn_batch_compiled <- function(d, ncol) { expressions <- paste0("x + ", seq_len(ncol)) names(expressions) <- paste0("x_", seq_len(ncol)) local_td(d) %.>% extend_se(., expressions) %.>% select_rows_nse(., x == 3) } ops <- rquery_fn_batch_compiled(d, 5) (d %.>% ops)[]
We can also try a sequentinal version.
rquery_fn_seq <- function(d, ncol) { ops <- local_td(d) for(i in seq_len(ncol)) { ops <- extend_se(ops, paste0("x_", i) %:=% paste0("x + ", i)) } ops <- select_rows_nse(ops, x == 3) d %.>% ops } rquery_fn_seq(d, 5)[]
And a pre-compiled rquery
sequential pipeline.
rquery_fn_seq_comp <- function(d, ncol) { ops <- local_td(d) for(i in seq_len(ncol)) { ops <- extend_se(ops, paste0("x_", i) %:=% paste0("x + ", i)) } select_rows_nse(ops, x == 3) } ops <- rquery_fn_seq_comp(d, 5) (d %.>% ops)[]
Same torture for dplyr
.
dplyr_fn <- function(tbl, ncol) { pipeline <- tbl xvar <- rlang::sym("x") for(i in seq_len(ncol)) { res_i <- rlang::sym(paste0("x_", i)) pipeline <- pipeline %>% mutate(., !!res_i := !!xvar + i) } pipeline <- pipeline %>% filter(., x == 3) pipeline %>% compute(.) } dplyr_fn(tbl, 5)
We can also collect expressions efficiently using seplyr
(seplyr
is a thin wrapper over dplyr
, so seplyr
's method mutate_se()
is essentially instructions how to do the same thing in batch using dplyr
/rlang
).
seplyr_fn <- function(tbl, ncol) { expressions <- paste0("x + ", seq_len(ncol)) names(expressions) <- paste0("x_", seq_len(ncol)) pipeline <- tbl %>% seplyr::mutate_se(., expressions) %>% filter(., x == 3) pipeline %>% compute(.) } seplyr_fn(tbl, 5)
And we can also run with data.table
either sequentially (as below) or in batch (which was the rqdatatable
result).
data_table_sequential_fn <- function(d, ncol) { # make sure we have a clean copy dt <- data.table::copy(as.data.table(d)) for(i in seq_len(ncol)) { dt[, paste0("x_", i) := eval(parse(text=paste0("x + ", i)))] } dt[x==3, ] } data_table_sequential_fn(tbl, 5)
Time the functions.
opsbc <- rquery_fn_batch_compiled(d, ncol) opssc <- rquery_fn_seq_comp(d, ncol) timings <- microbenchmark( rqdatatable_batch = rquery_fn_batch(d, ncol), rqdatatable_batch_compiled = { d %.>% opsbc }, rqdatatable_sequential = rquery_fn_seq(d, ncol), rqdatatable_sequential_compiled = { d %.>% opssc }, dplyr = dplyr_fn(tbl, ncol), seplyr = seplyr_fn(tbl, ncol), data_table_sequential = data_table_sequential_fn(d, ncol), times = 100L) saveRDS(timings, "CollectExprs_memory_timings.RDS") print(timings) #autoplot(timings)
Present the results.
timings <- as.data.frame(timings) timings$seconds <- timings$time/10^9 timings$method <- factor(timings$expr) timings$method <- reorder(timings$method, timings$seconds) WVPlots::ScatterBoxPlotH(timings, "seconds", "method", "task time by method") tratio <- timings %.>% project_nse(., groupby = "method", mean_seconds = mean(seconds)) %.>% pivot_to_rowrecs(., columnToTakeKeysFrom = "method", columnToTakeValuesFrom = "mean_seconds", rowKeyColumns = NULL) %.>% extend_nse(., ratio = dplyr/rqdatatable_batch) tratio[] ratio_str <- sprintf("%.2g", 1/tratio$ratio)
rqdatatable
in batch mode is about r ratio_str
times slower than dplyr
(and the other sequential implementations and even batch implementations) for this task at this scale for this data implementation and configuration. Likely this is due to copying and re-parsing overhead from rqdatatable
itself (unlikely to be a data.table
issue).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.